Skip to content

jedick/BreCol

Repository files navigation

BreCol: Benchmark for Microbiome-Based Cancer Detection

BreCol is a curated benchmark of 2,040 16S rRNA sequencing runs across 26 studies spanning breast cancer, colorectal cancer, and healthy cohorts. Its central question: do microbiome-based cancer classifiers generalize to studies they have never seen?


Overview

Machine learning models trained on fecal microbiome profiles have shown promise for distinguishing cancer patients from healthy controls. But a persistent problem undermines many published results: performance estimates are inflated when test samples are drawn from the same studies used for training. Cross-study differences in sequencing protocol, primer choice, and regional microbiome variation are some of the factors that limit the generalizability of models.

BreCol addresses this by structuring evaluation around a temporal holdout. Twenty-six studies are divided chronologically into a development partition and a holdout partition. Models are trained and tuned on development studies and evaluated on holdout studies they have never seen, reflecting how a model would actually perform on future data.

We benchmark two approaches to feature representation:

  • Classical MLk-nearest neighbors, random forest, and SVM applied to either run-level tetramer frequencies or cluster abundance profiles (UC/CAP, described below).
  • Deep learning — two pre-trained genome language models, HyenaDNA and SetBERT, fine-tuned for cancer classification.

Classical models reach test/holdout AUCs of 0.77/0.60 for cancer diagnosis (cancer vs. healthy) and 1.00/0.83 for cancer type (breast vs. colorectal). Both deep learning models underperform the best classical methods on holdout data. Our reference-free UC/CAP feature method achieves the best overall holdout performance without relying on taxonomic databases.


Study Design and Methods

Temporal holdout split

The 26 studies are split by publication year into two partitions for each cancer type: the first seven studies (pre-2023) form the development partition, and the six most recent studies (2023 onward) form the holdout partition. Development runs are further divided 70/10/20 into training, validation, and test splits. Holdout studies are never seen during training or hyperparameter selection.

This design creates a realistic challenge: predictions must transfer to datasets that became available only after the model was trained, eliminating the shortcut of learning study-level technical signals instead of cancer biology.

Classification tasks

  • Cancer diagnosis — cancer vs. healthy, using all samples.
  • Cancer type — breast vs. colorectal, using cancer-positive samples only.

Because breast and colorectal samples almost always come from different studies, a model can achieve near-perfect in-study accuracy on cancer type prediction by learning study identity rather than disease. Holdout evaluation removes this shortcut.

Feature representations

Run-level tetramer frequencies. All 4-mer counts are summed across the sequences in a run and converted to relative frequencies, producing a single 256-dimensional vector per run. Tetranucleotide (4-mer) frequencies are reference-free, meaning that they avoid any dependence on curated taxonomic databases. However, averaging across sequences discards within-run compositional structure.

Unsupervised clustering / cluster abundance profiles (UC/CAP). To recover within-run structure, sequences are clustered by tetramer composition using k-means (fit only on training-split sequences). Each run is then represented by the distribution of its sequences across clusters: a cluster abundance profile (CAP). This is conceptually analogous to OTU-based methods but entirely reference-free.

HyenaDNA. A long-range genomic sequence model pre-trained on the human reference genome. Sequences from each run are packed into context windows and the backbone hidden states are mean-pooled across token positions to produce a run-level embedding for classification.

SetBERT. A transformer pre-trained on ca. 280k microbial 16S rRNA samples with a relative-abundance prediction objective. Each read is encoded by a DNABERT encoder; a stack of Set Attention Blocks (SABs) contextualizes the reads from a single run and produces a [CLS] embedding summarizing the run.

For both deep learning models, three classification heads were tested: linear, MLP, and cosine similarity.


Dataset Compilation

A multi-study benchmark dataset was curated for the BreCol project. The data/breast/ and data/colorectal/ directories hold one CSV file per study, named by first-author initials and year. The datasets.csv file at the repository root records the partition assignment for each study (development or holdout).

Study list

Several studies have substantially larger sample counts than others. We used stratified subsampling (by cancer/healthy label) to improve study balance across the benchmark. Sample counts reflect sizes after subsampling.

Ref Year BioProject Type Cancer Healthy Partition Country
AAM+13 2013 PRJNA396901 breast 29 32 development United States
GJH+15 2015 PRJNA345373 breast 47 47 development United States
GHB+18 2018 PRJNA383849 breast 48 48 development United States
BVW+21 2021 PRJNA658160 breast 57 63 development Ghana
BSR+22 2022 PRJEB54599 breast 19 14 development United States
WZK+22 2022 PRJNA804967 breast 54 25 development China
ZZZ+22 2022 PRJNA726050 breast 14 14 development China
SKC+23 2023 PRJNA872152 breast 22 21 holdout United States
LBA+25 2025 PRJNA1127492 breast 76 16 holdout Spain
SYL+25 2025 PRJNA1243283 breast 10 10 holdout China
MTK+26 2026 PRJNA914483 breast 32 32 holdout Malaysia
SVK+26 2026 PRJNA1356467 breast 22 30 holdout India
YTK+26 2026 PRJNA1190698 breast 15 15 holdout Turkey
ZTV+14 2014 PRJEB6070 colorectal 41 75 development France
BRRS16 2016 PRJNA290926 colorectal 64 94 development United States/Canada
OKN+21 2021 PRJDB11246 colorectal 67 51 development Japan
YDS+21 2021 PRJNA763023 colorectal 65 43 development China
YWS+21 2021 PRJEB36789 colorectal 53 52 development Argentina/Chile/India/Vietnam
DLT+22 2022 PRJNA824020 colorectal 27 33 development China
PCL+22 2022 PRJNA662014 colorectal 36 25 development Singapore
BWY+23 2023 PRJEB53415 colorectal 46 43 holdout India
BRR+24 2024 PRJEB71787 colorectal 51 51 holdout Spain
CAB+24 2024 PRJNA911189 colorectal 90 30 holdout Spain
SGH+24 2024 PRJNA1059759 colorectal 10 10 holdout India
ARF+25 2025 PRJEB76625 colorectal 25 15 holdout Iran
GYX+25 2025 PRJNA1092526 colorectal 67 64 holdout China
PRJNA1092376

Per-study CSV format

Each CSV file contains one row per sequencing run. Key columns:

  • Run — SRA Run accession (SRR, ERR, or DRR prefix)
  • BioSample — SRA BioSample accession
  • sample_label — normalized label: healthy, breast_cancer, colorectal_cancer, or benign
  • sample_used — Boolean; TRUE for runs included in the analysis

Benign samples (adenomas, benign colon polyps, DCIS) and non-fecal samples are excluded from all modeling (sample_used = FALSE). Other study-specific columns (e.g. cohort) vary by file.


Data Analysis Pipeline

Install dependencies first, then run pipeline steps in order.

Installation

pip install -r requirements.txt

This installs all Python dependencies, including the local hyenadna and setbert packages in editable mode.

Pipeline steps

make download_data          # Download 16S sequences from SRA (~13 GB)
make -j4 tetramer_cache     # Build hive-partitioned Parquet cache (~16 min, 1.5 GB)
make tetramer_frequencies   # Sum cached counts per run (~2 min)
make -j4 fit_tetramer EXPT=0       # Train tetramer classifiers (~1 min)
make run_uc_cap FEAT=0             # Build cluster abundance profiles (~29 min, 13 GB RAM)
make -j4 fit_uc_cap FEAT=0 EXPT=0  # Train UC/CAP classifiers (~13 min)
make hyenadna_run_tensors          # Build HyenaDNA input tensors (~12 min, 2.5 GB)
make train_hyenadna EXPT=0         # Fine-tune HyenaDNA (~6 hr)
make setbert_run_tensors           # Build SetBERT input tensors (~43 min, 2.3 GB)
make train_setbert EXPT=0          # Fine-tune SetBERT (~17 hr)

After running the above, regenerate manuscript tables and figures:

python helpers/table*.py
python helpers/figure*.py

Notes on EXPT and FEAT: EXPT=0 runs all experiments listed in experiments.yaml for a target; FEAT=0 builds all UC/CAP feature sets. Use EXPT=N or FEAT=N to run a single entry. Non-GPU steps support -j for parallel execution. make fit_uc_cap requires both FEAT and EXPT to be specified when sweeping over one of them.

Debugging Make targets: make explain-<target> prints why Make would rebuild a target and its full prerequisite chain. For example: make explain-run_uc_cap FEAT=0.

Hardware: The full pipeline runs in approximately 25 hours on a machine with 8 CPU cores, a 16 GB NVIDIA GPU, and 32 GB of RAM.


Code Details

HyenaDNA

Source files are under hyenadna/. The package is installed by requirements.txt, or manually from the repo root:

pip install -e hyenadna

Local modifications are summarized in the comments within each file. To verify the installation: cd hyenadna; python -c 'import inference_example as ex; ex.inference_single()'

Pre-trained checkpoint: hyenadna-small-32k-seqlen, cloned from Hugging Face on first use if not already present under paths.checkpoint_dir.

SetBERT

Source files are under setbert/. The package was repackaged from three upstream sources:

Local modifications:

  • Use SDPA (scaled dot-product attention) for faster multi-head attention, including the relative position bias in RelativeMultiHeadAttention.
  • Set use_reentrant=False on the activation-checkpoint call so gradients reach the DNABERT encoder.
  • Match the destination embedding dtype to the encoder output to avoid bf16/fp32 conversions under AMP.

Pre-trained checkpoint: qiita-16s (12-layer DNABERT encoder, 768-dimensional embeddings, 6-layer SAB transformer). Note: the SetBERT paper describes a 64-dim model that is not published on Hugging Face.


Repository Layout

data/               Per-study CSV files (breast/ and colorectal/)
datasets.csv        Study list with partition assignments (development/holdout)
defaults.yaml       Default pipeline parameters
experiments.yaml    Experiment configurations for Make targets
scripts/            Analysis scripts called by Make targets
helpers/            Scripts for generating manuscript tables and figures
manuscript/         Manuscript source, figures, and tables [*]
hyenadna/           Local HyenaDNA package
setbert/            Local SetBERT package
Makefile            Pipeline entry point
requirements.txt    Python dependencies

[*] Note: manuscript.lyx is the live version of the manuscript that is being hand-edited in LyX. manuscript.md is an abandoned version that is kept here to retain edit history (including human and AI edits). make manuscript_pdf creates a PDF file from the outdated manuscript.md; refer to the LyX file instead for the latest changes.


Citation

  • If you use BreCol data or code, please cite the accompanying manuscript (forthcoming).
  • Raw sequencing data are available at the SRA accessions listed in the study table above; those studies should be cited when using their data.

About

Benchmark for microbiome-based cancer detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors