Skip to content

_extract_representative_docs samples with replace=True, producing duplicate representative documents #2491

Description

@pidefrem

Describe the bug

_extract_representative_docs calls sample(nr_samples, replace=True) on each topic's document pool. When a topic has fewer than nr_samples unique documents (common for small topics), the same document can be drawn multiple times. These duplicates are then fed into the c-TF-IDF similarity calculation, inflating scores and producing duplicate entries in representative_docs_.

The existing .drop_duplicates() runs after .groupby("Topic").sample(...), so it only removes exact duplicate rows across the entire result — it does not prevent replace=True from drawing the same document multiple times within a single topic's sample.

Reproduction

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"][:500]
topic_model = BERTopic(min_topic_size=5)
topics, _ = topic_model.fit_transform(docs)

# Check for duplicate representative docs within the same topic
for topic_id, docs_list in topic_model.representative_docs_.items():
    if len(docs_list) != len(set(docs_list)):
        print(f"Topic {topic_id}: {len(docs_list)} docs, {len(set(docs_list))} unique")

BERTopic Version

0.17.4

Your contribution

I've already worked through a fix for this in my fork, with tests. Happy to open a PR if this looks like the right approach — just let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions