BreCol is a curated benchmark of 2,040 16S rRNA sequencing runs across 26 studies spanning breast cancer, colorectal cancer, and healthy cohorts. Its central question: do microbiome-based cancer classifiers generalize to studies they have never seen?
Machine learning models trained on fecal microbiome profiles have shown promise for distinguishing cancer patients from healthy controls. But a persistent problem undermines many published results: performance estimates are inflated when test samples are drawn from the same studies used for training. Cross-study differences in sequencing protocol, primer choice, and regional microbiome variation are some of the factors that limit the generalizability of models.
BreCol addresses this by structuring evaluation around a temporal holdout. Twenty-six studies are divided chronologically into a development partition and a holdout partition. Models are trained and tuned on development studies and evaluated on holdout studies they have never seen, reflecting how a model would actually perform on future data.
We benchmark two approaches to feature representation:
- Classical ML — k-nearest neighbors, random forest, and SVM applied to either run-level tetramer frequencies or cluster abundance profiles (UC/CAP, described below).
- Deep learning — two pre-trained genome language models, HyenaDNA and SetBERT, fine-tuned for cancer classification.
Classical models reach test/holdout AUCs of 0.77/0.60 for cancer diagnosis (cancer vs. healthy) and 1.00/0.83 for cancer type (breast vs. colorectal). Both deep learning models underperform the best classical methods on holdout data. Our reference-free UC/CAP feature method achieves the best overall holdout performance without relying on taxonomic databases.
The 26 studies are split by publication year into two partitions for each cancer type: the first seven studies (pre-2023) form the development partition, and the six most recent studies (2023 onward) form the holdout partition. Development runs are further divided 70/10/20 into training, validation, and test splits. Holdout studies are never seen during training or hyperparameter selection.
This design creates a realistic challenge: predictions must transfer to datasets that became available only after the model was trained, eliminating the shortcut of learning study-level technical signals instead of cancer biology.
- Cancer diagnosis — cancer vs. healthy, using all samples.
- Cancer type — breast vs. colorectal, using cancer-positive samples only.
Because breast and colorectal samples almost always come from different studies, a model can achieve near-perfect in-study accuracy on cancer type prediction by learning study identity rather than disease. Holdout evaluation removes this shortcut.
Run-level tetramer frequencies. All 4-mer counts are summed across the sequences in a run and converted to relative frequencies, producing a single 256-dimensional vector per run. Tetranucleotide (4-mer) frequencies are reference-free, meaning that they avoid any dependence on curated taxonomic databases. However, averaging across sequences discards within-run compositional structure.
Unsupervised clustering / cluster abundance profiles (UC/CAP). To recover within-run structure, sequences are clustered by tetramer composition using k-means (fit only on training-split sequences). Each run is then represented by the distribution of its sequences across clusters: a cluster abundance profile (CAP). This is conceptually analogous to OTU-based methods but entirely reference-free.
HyenaDNA. A long-range genomic sequence model pre-trained on the human reference genome. Sequences from each run are packed into context windows and the backbone hidden states are mean-pooled across token positions to produce a run-level embedding for classification.
SetBERT. A transformer pre-trained on ca. 280k microbial 16S rRNA samples with a relative-abundance prediction objective. Each read is encoded by a DNABERT encoder; a stack of Set Attention Blocks (SABs) contextualizes the reads from a single run and produces a [CLS] embedding summarizing the run.
For both deep learning models, three classification heads were tested: linear, MLP, and cosine similarity.
A multi-study benchmark dataset was curated for the BreCol project.
The data/breast/ and data/colorectal/ directories hold one CSV file per study,
named by first-author initials and year. The datasets.csv file at the
repository root records the partition assignment for each study (development or holdout).
Several studies have substantially larger sample counts than others. We used stratified subsampling (by cancer/healthy label) to improve study balance across the benchmark. Sample counts reflect sizes after subsampling.
| Ref | Year | BioProject | Type | Cancer | Healthy | Partition | Country |
|---|---|---|---|---|---|---|---|
| AAM+13 | 2013 | PRJNA396901 | breast | 29 | 32 | development | United States |
| GJH+15 | 2015 | PRJNA345373 | breast | 47 | 47 | development | United States |
| GHB+18 | 2018 | PRJNA383849 | breast | 48 | 48 | development | United States |
| BVW+21 | 2021 | PRJNA658160 | breast | 57 | 63 | development | Ghana |
| BSR+22 | 2022 | PRJEB54599 | breast | 19 | 14 | development | United States |
| WZK+22 | 2022 | PRJNA804967 | breast | 54 | 25 | development | China |
| ZZZ+22 | 2022 | PRJNA726050 | breast | 14 | 14 | development | China |
| SKC+23 | 2023 | PRJNA872152 | breast | 22 | 21 | holdout | United States |
| LBA+25 | 2025 | PRJNA1127492 | breast | 76 | 16 | holdout | Spain |
| SYL+25 | 2025 | PRJNA1243283 | breast | 10 | 10 | holdout | China |
| MTK+26 | 2026 | PRJNA914483 | breast | 32 | 32 | holdout | Malaysia |
| SVK+26 | 2026 | PRJNA1356467 | breast | 22 | 30 | holdout | India |
| YTK+26 | 2026 | PRJNA1190698 | breast | 15 | 15 | holdout | Turkey |
| ZTV+14 | 2014 | PRJEB6070 | colorectal | 41 | 75 | development | France |
| BRRS16 | 2016 | PRJNA290926 | colorectal | 64 | 94 | development | United States/Canada |
| OKN+21 | 2021 | PRJDB11246 | colorectal | 67 | 51 | development | Japan |
| YDS+21 | 2021 | PRJNA763023 | colorectal | 65 | 43 | development | China |
| YWS+21 | 2021 | PRJEB36789 | colorectal | 53 | 52 | development | Argentina/Chile/India/Vietnam |
| DLT+22 | 2022 | PRJNA824020 | colorectal | 27 | 33 | development | China |
| PCL+22 | 2022 | PRJNA662014 | colorectal | 36 | 25 | development | Singapore |
| BWY+23 | 2023 | PRJEB53415 | colorectal | 46 | 43 | holdout | India |
| BRR+24 | 2024 | PRJEB71787 | colorectal | 51 | 51 | holdout | Spain |
| CAB+24 | 2024 | PRJNA911189 | colorectal | 90 | 30 | holdout | Spain |
| SGH+24 | 2024 | PRJNA1059759 | colorectal | 10 | 10 | holdout | India |
| ARF+25 | 2025 | PRJEB76625 | colorectal | 25 | 15 | holdout | Iran |
| GYX+25 | 2025 | PRJNA1092526 | colorectal | 67 | 64 | holdout | China |
| PRJNA1092376 |
Each CSV file contains one row per sequencing run. Key columns:
Run— SRA Run accession (SRR, ERR, or DRR prefix)BioSample— SRA BioSample accessionsample_label— normalized label:healthy,breast_cancer,colorectal_cancer, orbenignsample_used— Boolean;TRUEfor runs included in the analysis
Benign samples (adenomas, benign colon polyps, DCIS) and non-fecal samples are
excluded from all modeling (sample_used = FALSE). Other study-specific columns
(e.g. cohort) vary by file.
Install dependencies first, then run pipeline steps in order.
pip install -r requirements.txt
This installs all Python dependencies, including the local hyenadna and setbert
packages in editable mode.
make download_data # Download 16S sequences from SRA (~13 GB)
make -j4 tetramer_cache # Build hive-partitioned Parquet cache (~16 min, 1.5 GB)
make tetramer_frequencies # Sum cached counts per run (~2 min)
make -j4 fit_tetramer EXPT=0 # Train tetramer classifiers (~1 min)
make run_uc_cap FEAT=0 # Build cluster abundance profiles (~29 min, 13 GB RAM)
make -j4 fit_uc_cap FEAT=0 EXPT=0 # Train UC/CAP classifiers (~13 min)
make hyenadna_run_tensors # Build HyenaDNA input tensors (~12 min, 2.5 GB)
make train_hyenadna EXPT=0 # Fine-tune HyenaDNA (~6 hr)
make setbert_run_tensors # Build SetBERT input tensors (~43 min, 2.3 GB)
make train_setbert EXPT=0 # Fine-tune SetBERT (~17 hr)
After running the above, regenerate manuscript tables and figures:
python helpers/table*.py
python helpers/figure*.py
Notes on EXPT and FEAT:
EXPT=0 runs all experiments listed in experiments.yaml for a target;
FEAT=0 builds all UC/CAP feature sets. Use EXPT=N or FEAT=N to run a single
entry. Non-GPU steps support -j for parallel execution.
make fit_uc_cap requires both FEAT and EXPT to be specified when sweeping
over one of them.
Debugging Make targets:
make explain-<target> prints why Make would rebuild a target and its full
prerequisite chain. For example: make explain-run_uc_cap FEAT=0.
Hardware: The full pipeline runs in approximately 25 hours on a machine with 8 CPU cores, a 16 GB NVIDIA GPU, and 32 GB of RAM.
Source files are under hyenadna/. The package is installed by requirements.txt,
or manually from the repo root:
pip install -e hyenadna
standalone_hyenadna.py— downloaded from HazyResearch/hyena-dnahuggingface_wrapper.pyandinference_example.py— extracted from the HyenaDNA Colab Notebook
Local modifications are summarized in the comments within each file. To verify the
installation: cd hyenadna; python -c 'import inference_example as ex; ex.inference_single()'
Pre-trained checkpoint: hyenadna-small-32k-seqlen, cloned from Hugging Face
on first use if not already present under paths.checkpoint_dir.
Source files are under setbert/. The package was repackaged from three upstream
sources:
- deepbio-toolkit v0.4.5
- dbtk-dnabert v1.2.3
- dbtk-setbert v1.0.3
Local modifications:
- Use SDPA (scaled dot-product attention) for faster multi-head attention,
including the relative position bias in
RelativeMultiHeadAttention. - Set
use_reentrant=Falseon the activation-checkpoint call so gradients reach the DNABERT encoder. - Match the destination embedding dtype to the encoder output to avoid bf16/fp32 conversions under AMP.
Pre-trained checkpoint:
qiita-16s (12-layer DNABERT
encoder, 768-dimensional embeddings, 6-layer SAB transformer). Note: the
SetBERT paper describes a 64-dim model that is not published on Hugging Face.
data/ Per-study CSV files (breast/ and colorectal/)
datasets.csv Study list with partition assignments (development/holdout)
defaults.yaml Default pipeline parameters
experiments.yaml Experiment configurations for Make targets
scripts/ Analysis scripts called by Make targets
helpers/ Scripts for generating manuscript tables and figures
manuscript/ Manuscript source, figures, and tables [*]
hyenadna/ Local HyenaDNA package
setbert/ Local SetBERT package
Makefile Pipeline entry point
requirements.txt Python dependencies
[*] Note: manuscript.lyx is the live version of the manuscript that is being hand-edited in LyX.
manuscript.md is an abandoned version that is kept here to retain edit history (including human and AI edits).
make manuscript_pdf creates a PDF file from the outdated manuscript.md; refer to the LyX file instead for the latest changes.
- If you use BreCol data or code, please cite the accompanying manuscript (forthcoming).
- Raw sequencing data are available at the SRA accessions listed in the study table above; those studies should be cited when using their data.