Describe the bug
_extract_representative_docs calls sample(nr_samples, replace=True) on each topic's document pool. When a topic has fewer than nr_samples unique documents (common for small topics), the same document can be drawn multiple times. These duplicates are then fed into the c-TF-IDF similarity calculation, inflating scores and producing duplicate entries in representative_docs_.
The existing .drop_duplicates() runs after .groupby("Topic").sample(...), so it only removes exact duplicate rows across the entire result — it does not prevent replace=True from drawing the same document multiple times within a single topic's sample.
Reproduction
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"][:500]
topic_model = BERTopic(min_topic_size=5)
topics, _ = topic_model.fit_transform(docs)
# Check for duplicate representative docs within the same topic
for topic_id, docs_list in topic_model.representative_docs_.items():
if len(docs_list) != len(set(docs_list)):
print(f"Topic {topic_id}: {len(docs_list)} docs, {len(set(docs_list))} unique")
BERTopic Version
0.17.4
Your contribution
I've already worked through a fix for this in my fork, with tests. Happy to open a PR if this looks like the right approach — just let me know.
Describe the bug
_extract_representative_docscallssample(nr_samples, replace=True)on each topic's document pool. When a topic has fewer thannr_samplesunique documents (common for small topics), the same document can be drawn multiple times. These duplicates are then fed into the c-TF-IDF similarity calculation, inflating scores and producing duplicate entries inrepresentative_docs_.The existing
.drop_duplicates()runs after.groupby("Topic").sample(...), so it only removes exact duplicate rows across the entire result — it does not preventreplace=Truefrom drawing the same document multiple times within a single topic's sample.Reproduction
BERTopic Version
0.17.4
Your contribution
I've already worked through a fix for this in my fork, with tests. Happy to open a PR if this looks like the right approach — just let me know.