BreCol: Benchmark for Microbiome-Based Cancer Detection

BreCol is a curated benchmark of 2,040 16S rRNA sequencing runs across 26 studies spanning breast cancer, colorectal cancer, and healthy cohorts. Its central question: do microbiome-based cancer classifiers generalize to studies they have never seen?

Overview

Machine learning models trained on fecal microbiome profiles have shown promise for distinguishing cancer patients from healthy controls. But a persistent problem undermines many published results: performance estimates are inflated when test samples are drawn from the same studies used for training. Cross-study differences in sequencing protocol, primer choice, and regional microbiome variation are some of the factors that limit the generalizability of models.

BreCol addresses this by structuring evaluation around a temporal holdout. Twenty-six studies are divided chronologically into a development partition and a holdout partition. Models are trained and tuned on development studies and evaluated on holdout studies they have never seen, reflecting how a model would actually perform on future data.

We benchmark two approaches to feature representation:

Classical ML — k-nearest neighbors, random forest, and SVM applied to either run-level tetramer frequencies or cluster abundance profiles (UC/CAP, described below).
Deep learning — two pre-trained genome language models, HyenaDNA and SetBERT, fine-tuned for cancer classification.

Classical models reach test/holdout AUCs of 0.77/0.60 for cancer diagnosis (cancer vs. healthy) and 1.00/0.83 for cancer type (breast vs. colorectal). Both deep learning models underperform the best classical methods on holdout data. Our reference-free UC/CAP feature method achieves the best overall holdout performance without relying on taxonomic databases.

Study Design and Methods

Temporal holdout split

The 26 studies are split by publication year into two partitions for each cancer type: the first seven studies (pre-2023) form the development partition, and the six most recent studies (2023 onward) form the holdout partition. Development runs are further divided 70/10/20 into training, validation, and test splits. Holdout studies are never seen during training or hyperparameter selection.

This design creates a realistic challenge: predictions must transfer to datasets that became available only after the model was trained, eliminating the shortcut of learning study-level technical signals instead of cancer biology.

Classification tasks

Cancer diagnosis — cancer vs. healthy, using all samples.
Cancer type — breast vs. colorectal, using cancer-positive samples only.

Because breast and colorectal samples almost always come from different studies, a model can achieve near-perfect in-study accuracy on cancer type prediction by learning study identity rather than disease. Holdout evaluation removes this shortcut.

Feature representations

Run-level tetramer frequencies. All 4-mer counts are summed across the sequences in a run and converted to relative frequencies, producing a single 256-dimensional vector per run. Tetranucleotide (4-mer) frequencies are reference-free, meaning that they avoid any dependence on curated taxonomic databases. However, averaging across sequences discards within-run compositional structure.

Unsupervised clustering / cluster abundance profiles (UC/CAP). To recover within-run structure, sequences are clustered by tetramer composition using k-means (fit only on training-split sequences). Each run is then represented by the distribution of its sequences across clusters: a cluster abundance profile (CAP). This is conceptually analogous to OTU-based methods but entirely reference-free.

HyenaDNA. A long-range genomic sequence model pre-trained on the human reference genome. Sequences from each run are packed into context windows and the backbone hidden states are mean-pooled across token positions to produce a run-level embedding for classification.

SetBERT. A transformer pre-trained on ca. 280k microbial 16S rRNA samples with a relative-abundance prediction objective. Each read is encoded by a DNABERT encoder; a stack of Set Attention Blocks (SABs) contextualizes the reads from a single run and produces a [CLS] embedding summarizing the run.

For both deep learning models, three classification heads were tested: linear, MLP, and cosine similarity.

Dataset Compilation

A multi-study benchmark dataset was curated for the BreCol project. The data/breast/ and data/colorectal/ directories hold one CSV file per study, named by first-author initials and year. The datasets.csv file at the repository root records the partition assignment for each study (development or holdout).

Study list

Several studies have substantially larger sample counts than others. We used stratified subsampling (by cancer/healthy label) to improve study balance across the benchmark. Sample counts reflect sizes after subsampling.

Ref	Year	BioProject	Type	Cancer	Healthy	Partition	Country
AAM+13	2013	PRJNA396901	breast	29	32	development	United States
GJH+15	2015	PRJNA345373	breast	47	47	development	United States
GHB+18	2018	PRJNA383849	breast	48	48	development	United States
BVW+21	2021	PRJNA658160	breast	57	63	development	Ghana
BSR+22	2022	PRJEB54599	breast	19	14	development	United States
WZK+22	2022	PRJNA804967	breast	54	25	development	China
ZZZ+22	2022	PRJNA726050	breast	14	14	development	China
SKC+23	2023	PRJNA872152	breast	22	21	holdout	United States
LBA+25	2025	PRJNA1127492	breast	76	16	holdout	Spain
SYL+25	2025	PRJNA1243283	breast	10	10	holdout	China
MTK+26	2026	PRJNA914483	breast	32	32	holdout	Malaysia
SVK+26	2026	PRJNA1356467	breast	22	30	holdout	India
YTK+26	2026	PRJNA1190698	breast	15	15	holdout	Turkey
ZTV+14	2014	PRJEB6070	colorectal	41	75	development	France
BRRS16	2016	PRJNA290926	colorectal	64	94	development	United States/Canada
OKN+21	2021	PRJDB11246	colorectal	67	51	development	Japan
YDS+21	2021	PRJNA763023	colorectal	65	43	development	China
YWS+21	2021	PRJEB36789	colorectal	53	52	development	Argentina/Chile/India/Vietnam
DLT+22	2022	PRJNA824020	colorectal	27	33	development	China
PCL+22	2022	PRJNA662014	colorectal	36	25	development	Singapore
BWY+23	2023	PRJEB53415	colorectal	46	43	holdout	India
BRR+24	2024	PRJEB71787	colorectal	51	51	holdout	Spain
CAB+24	2024	PRJNA911189	colorectal	90	30	holdout	Spain
SGH+24	2024	PRJNA1059759	colorectal	10	10	holdout	India
ARF+25	2025	PRJEB76625	colorectal	25	15	holdout	Iran
GYX+25	2025	PRJNA1092526	colorectal	67	64	holdout	China
		PRJNA1092376

Per-study CSV format

Each CSV file contains one row per sequencing run. Key columns:

Run — SRA Run accession (SRR, ERR, or DRR prefix)
BioSample — SRA BioSample accession
sample_label — normalized label: healthy, breast_cancer, colorectal_cancer, or benign
sample_used — Boolean; TRUE for runs included in the analysis

Benign samples (adenomas, benign colon polyps, DCIS) and non-fecal samples are excluded from all modeling (sample_used = FALSE). Other study-specific columns (e.g. cohort) vary by file.

Data Analysis Pipeline

Install dependencies first, then run pipeline steps in order.

Installation

pip install -r requirements.txt

This installs all Python dependencies, including the local hyenadna and setbert packages in editable mode.

Pipeline steps

make download_data          # Download 16S sequences from SRA (~13 GB)
make -j4 tetramer_cache     # Build hive-partitioned Parquet cache (~16 min, 1.5 GB)
make tetramer_frequencies   # Sum cached counts per run (~2 min)
make -j4 fit_tetramer EXPT=0       # Train tetramer classifiers (~1 min)
make run_uc_cap FEAT=0             # Build cluster abundance profiles (~29 min, 13 GB RAM)
make -j4 fit_uc_cap FEAT=0 EXPT=0  # Train UC/CAP classifiers (~13 min)
make hyenadna_run_tensors          # Build HyenaDNA input tensors (~12 min, 2.5 GB)
make train_hyenadna EXPT=0         # Fine-tune HyenaDNA (~6 hr)
make setbert_run_tensors           # Build SetBERT input tensors (~43 min, 2.3 GB)
make train_setbert EXPT=0          # Fine-tune SetBERT (~17 hr)

After running the above, regenerate manuscript tables and figures:

python helpers/table*.py
python helpers/figure*.py

Notes on EXPT and FEAT: EXPT=0 runs all experiments listed in experiments.yaml for a target; FEAT=0 builds all UC/CAP feature sets. Use EXPT=N or FEAT=N to run a single entry. Non-GPU steps support -j for parallel execution. make fit_uc_cap requires both FEAT and EXPT to be specified when sweeping over one of them.

Debugging Make targets: make explain-<target> prints why Make would rebuild a target and its full prerequisite chain. For example: make explain-run_uc_cap FEAT=0.

Hardware: The full pipeline runs in approximately 25 hours on a machine with 8 CPU cores, a 16 GB NVIDIA GPU, and 32 GB of RAM.

Code Details

HyenaDNA

Source files are under hyenadna/. The package is installed by requirements.txt, or manually from the repo root:

pip install -e hyenadna

standalone_hyenadna.py — downloaded from HazyResearch/hyena-dna
huggingface_wrapper.py and inference_example.py — extracted from the HyenaDNA Colab Notebook

Local modifications are summarized in the comments within each file. To verify the installation: cd hyenadna; python -c 'import inference_example as ex; ex.inference_single()'

Pre-trained checkpoint: hyenadna-small-32k-seqlen, cloned from Hugging Face on first use if not already present under paths.checkpoint_dir.

SetBERT

Source files are under setbert/. The package was repackaged from three upstream sources:

Local modifications:

Use SDPA (scaled dot-product attention) for faster multi-head attention, including the relative position bias in RelativeMultiHeadAttention.
Set use_reentrant=False on the activation-checkpoint call so gradients reach the DNABERT encoder.
Match the destination embedding dtype to the encoder output to avoid bf16/fp32 conversions under AMP.

Pre-trained checkpoint: qiita-16s (12-layer DNABERT encoder, 768-dimensional embeddings, 6-layer SAB transformer). Note: the SetBERT paper describes a 64-dim model that is not published on Hugging Face.

Repository Layout

data/               Per-study CSV files (breast/ and colorectal/)
datasets.csv        Study list with partition assignments (development/holdout)
defaults.yaml       Default pipeline parameters
experiments.yaml    Experiment configurations for Make targets
scripts/            Analysis scripts called by Make targets
helpers/            Scripts for generating manuscript tables and figures
manuscript/         Manuscript source, figures, and tables [*]
hyenadna/           Local HyenaDNA package
setbert/            Local SetBERT package
Makefile            Pipeline entry point
requirements.txt    Python dependencies

[*] Note: manuscript.lyx is the live version of the manuscript that is being hand-edited in LyX. manuscript.md is an abandoned version that is kept here to retain edit history (including human and AI edits). make manuscript_pdf creates a PDF file from the outdated manuscript.md; refer to the LyX file instead for the latest changes.

Citation

If you use BreCol data or code, please cite the accompanying manuscript (forthcoming).
Raw sequencing data are available at the SRA accessions listed in the study table above; those studies should be cited when using their data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BreCol: Benchmark for Microbiome-Based Cancer Detection

Overview

Study Design and Methods

Temporal holdout split

Classification tasks

Feature representations

Dataset Compilation

Study list

Per-study CSV format

Data Analysis Pipeline

Installation

Pipeline steps

Code Details

HyenaDNA

SetBERT

Repository Layout

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.cursor		.cursor
data		data
helpers		helpers
hyenadna		hyenadna
manuscript		manuscript
prompts		prompts
scripts		scripts
setbert		setbert
.cursorindexingignore		.cursorindexingignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
datasets.csv		datasets.csv
defaults.yaml		defaults.yaml
experiments.yaml		experiments.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BreCol: Benchmark for Microbiome-Based Cancer Detection

Overview

Study Design and Methods

Temporal holdout split

Classification tasks

Feature representations

Dataset Compilation

Study list

Per-study CSV format

Data Analysis Pipeline

Installation

Pipeline steps

Code Details

HyenaDNA

SetBERT

Repository Layout

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages