Skip to content

TabArena/data-foundry

Repository files navigation

Data Foundry: a Schema and Toolkit for Curating Tabular ML Datasets


πŸ“‚ Examples πŸ§‘β€πŸ”¬ Contribute a Dataset πŸ“„ Paper (placeholder β€” coming soon)

Data Foundry is the data layer behind the next generation of TabArena datasets. It provides:

  • A small, opinionated schema for tabular datasets, tasks (IID / temporal non-IID / grouped non-IID), and outer CV splits β€” aligned with OpenML where possible, extended where it had to be.
  • A curation toolkit (sanity checks, recommended-split helpers, dtype-preserving save/load) so a curator turns a raw download into a reproducible artifact in one notebook.
  • A collections API that pins datasets (defined by (unique_name, uuid)) to immutable curated containers and resolves them against a local warehouse or directly against the BeyondArena Datasets.

⚑ Quickstart

Tip

Pull a real curated dataset from BeyondArena and inspect its full metadata + outer CV splits. The first call fetches from Hugging Face; subsequent calls hit your local cache.

pip install data-foundry
python examples/load_curated_container.py
from data_foundry.collections import BEYOND_ARENA

container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
print(container.describe())          # full identity + dtypes + task + splits
print(container.dataset.shape)       # the actual DataFrame
print(container.task_metadata.split_regime)  # "iid", "temporal_non_iid", or "grouped_non_iid"

That's the whole API surface in three lines. See examples/benchmark_on_beyond_arena.py for benchmarking Random Forest on the data!

πŸ•ΉοΈ Use Cases

πŸ§ͺ Inspect a curated container offline β€” no Hugging Face download required

The package ships a toy CuratedContainer so you can poke at the full API β€” schema, dtypes, splits, describe() β€” without touching the network. Identical interface to a downloaded BeyondArena container.

from data_foundry.curation_container import CuratedContainer
from data_foundry.examples import get_toy_container_path

container = CuratedContainer.load(get_toy_container_path())
print(container.describe())          # full identity + dtypes + task + splits
print(container.dataset.shape)       # the actual DataFrame
print(container.task_metadata.split_regime)  # "iid", "temporal_non_iid", or "grouped_non_iid"

Full inspection script (every metadata field printed): examples/load_curated_container.py.

πŸ“¦ Use one dataset β€” IID and non-IID variants

Download a single BeyondArena container by name (or UUID) and iterate its outer CV splits. The collection resolves the container against your local cache; subsequent runs hit disk, not the network.

from data_foundry.collections import BEYOND_ARENA

container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
df = container.dataset
target = container.task_metadata.target_column_name

for repeat_id, folds in container.experiment_metadata.splits.items():
    for fold_id, (train_idx, test_idx) in folds.items():
        X_train, y_train = df.iloc[train_idx].drop(columns=target), df.iloc[train_idx][target]
        X_test,  y_test  = df.iloc[test_idx].drop(columns=target),  df.iloc[test_idx][target]
        # ... fit, evaluate ...

Full worked example (Random Forest, RMSE per fold, full metadata via container.describe()): examples/benchmark_on_beyond_arena.py.

Split regimes. BeyondArena ships datasets from three regimes β€” which one a dataset is in shows up directly on task_metadata:

Regime Set on PredictiveMLTaskMetadata Meaning
IID neither time_on nor group_on rows are independent; random / stratified splits
temporal non-IID time_on set rows ordered in time; future rows must not leak backwards
grouped non-IID group_on set (+ group_labels) all rows of a group stay together in one fold

Side-by-side regime printout (one IID, two grouped variants β€” per_group vs per_sample β€” and one temporal): examples/data_foundry_data_regimes.py.

πŸ—‚οΈ Use a collection of datasets β€” pre-download all of BeyondArena

BEYOND_ARENA.prefetch(...) batches every container into a single Hugging Face snapshot_download call (one network round-trip for the whole collection). On a warm cache it skips importing huggingface_hub entirely.

from data_foundry.collections import BEYOND_ARENA

paths = BEYOND_ARENA.prefetch()          # warms the cache once
for container in BEYOND_ARENA.iter_containers():  # now hits disk only
    print(container.dataset_metadata.unique_name, container.dataset.shape)

Cache management:

BEYOND_ARENA.clear_cache()                 # nuke this collection's subdir
BEYOND_ARENA.get_dataset(name, force_download=True)  # re-fetch a single container

Full worked example with tqdm progress + checksum verification: examples/download_all_beyond_arena_datasets.py. For a single dataset round-trip with checksum verification, see examples/download_beyond_arena_dataset.py.

πŸ§‘β€πŸ”¬ Curate a dataset β€” turn a raw download into a CuratedContainer

End-to-end pipeline, condensed (the full runnable version is examples/curate_a_dataset.py):

from data_foundry.schema import DatasetMetadata, PredictiveMLTaskMetadata

# --- Basic metadata
dataset_mold = DatasetMetadata(
    unique_name="blood_transfusion",
    dataset_year="2008",
    domain_str="medical & healthcare",
    dataset_source="UCI",
    original_dataset_source_download_link="https://doi.org/10.24432/C5GS39",
    download_description="""
We download the data from the UCI repository and unzip it to a predefined folder.

mkdir -p local-data-warehouse/blood_transfusion/ \\
  && wget -P local-data-warehouse/blood_transfusion/ \\
       https://archive.ics.uci.edu/static/public/176/blood+transfusion+service+center.zip \\
  && unzip local-data-warehouse/blood_transfusion/blood+transfusion+service+center.zip \\
       -d local-data-warehouse/blood_transfusion/
""",
    academic_reference_bibtex="""@article{yeh2009knowledge,
  title={Knowledge discovery on RFM model using Bernoulli sequence},
  author={Yeh, I-Cheng and Yang, King-Jang and Ting, Tao-Ming},
  journal={Expert Systems with applications},
  volume={36}, number={3}, pages={5866--5871},
  year={2009}, publisher={Elsevier},
}
""",
    academic_reference_bibtex_key="yeh2009knowledge",
    license="CC BY 4.0",
    data_tags=["IID"],
    curation_comments="Renamed features for clarity; mapped target 0/1 β†’ No/Yes; ~29% duplicate rows kept.",
)
task_mold = PredictiveMLTaskMetadata(
    target_column_name="DonatedBloodInMarch2007",
    problem_type="binary_classification",
    objective_metric_name="roc_auc",
    stratify_on="DonatedBloodInMarch2007",
)

# --- Preprocessing
import pandas as pd
df = pd.read_csv(f"{dataset_mold.path}/transfusion.data")
df.columns = [
    "MonthsSinceLastDonation", "NumberOfDonations", "TotalBloodDonated",
    "MonthsSinceFirstDonation", "DonatedBloodInMarch2007",
]
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].map({1: "Yes", 0: "No"})
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].astype("category")
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# --- Sanity checks
from data_foundry import dataset_checks
df_head, summary, numeric_stats, cat_stats, target_df = dataset_checks.run_all_checks(
    data=df,
    target_feature=task_mold.target_column_name,
    problem_type=task_mold.problem_type,
)

# --- Outer CV splits
from data_foundry.curation_recommendations import (
    get_recommended_iid_splits,
    get_recommended_splits_dimensions,
)

n_repeats, n_splits, test_size = get_recommended_splits_dimensions(dataset=df)
splits = get_recommended_iid_splits(
    dataset=df,
    n_repeats=n_repeats,
    n_splits=n_splits,
    test_size=test_size,
    stratify_on=task_mold.stratify_on,
)

# --- Split metadata + container
from data_foundry.schema import PredictiveMLSplitsMetadata
from data_foundry.curation_container import CuratedContainer

splits_mold = PredictiveMLSplitsMetadata(
    splits_comment="Default splits for IID data.",
    splits=splits,
)
curated_data = CuratedContainer(
    dataset=df,
    dataset_metadata=dataset_mold,
    task_metadata=task_mold,
    experiment_metadata=splits_mold,
)
curated_data.save()
print(curated_data.uuid, curated_data.checksum)

For the contributor flow (where to put the notebook, how to open the PR, the /new-dataset Claude Code skill, best practices around versioning, anomaly tracking, and dtype handling), see CONTRIBUTING_DATASETS.md.

πŸͺ„ Installation

Important

Requires Python 3.10+.

πŸ“¦ From PyPI β€” use Data Foundry as a library
pip install data-foundry
🌱 From source β€” clone and install editable
git clone https://github.com/TabArena/data-foundry.git
cd data-foundry
uv pip install -e .
πŸ› οΈ Developer setup β€” extras for curation, tests, and tooling
git clone https://github.com/TabArena/data-foundry.git
cd data-foundry
uv pip install -e ".[dev,tests]"
pytest                                 # run the test suite
ruff check . && ruff format --check .  # lint + format

The dev extra adds curation-time deps (openml, kaggle, seaborn, polars, etc.); tests adds pytest and scikit-learn (needed for the recommended-split helpers and examples).

πŸ—‚οΈ Repository Structure

data-foundry/
β”œβ”€β”€ src/data_foundry/         # the package β€” schema, container, collections, checks, splits
β”‚   β”œβ”€β”€ schema.py             # DatasetMetadata, PredictiveMLTaskMetadata, PredictiveMLSplitsMetadata
β”‚   β”œβ”€β”€ curation_container.py # CuratedContainer (save/load + describe + checksum)
β”‚   β”œβ”€β”€ collections/          # BEYOND_ARENA, DatasetCollection, HuggingFaceSource, cache helpers
β”‚   β”œβ”€β”€ curation_recommendations.py  # recommended split helpers (IID, grouped, temporal)
β”‚   β”œβ”€β”€ dataset_checks.py     # run_all_checks(...) β€” sanity stats for the curation notebook
β”‚   └── examples/toy_container/  # tiny ready-to-load CuratedContainer shipped in-package
β”œβ”€β”€ datasets/                 # curation notebooks
β”‚   β”œβ”€β”€ _template/            # canonical notebook skeleton
β”‚   β”œβ”€β”€ _dev/                 # contributions land here first
β”‚   β”œβ”€β”€ _maintenance/         # re-runs / fixes for already-released datasets
β”‚   └── beyond_iid/           # promoted datasets β€” pinned by `final_uuid_list.py`
β”œβ”€β”€ examples/                 # runnable demos (covers the use-cases above)
β”œβ”€β”€ scripts/                  # one-off tooling (toy container builder)
β”‚   └── beyond_arena/         # BeyondArena-specific scripts and outputs (warehouse stats, plots)
β”œβ”€β”€ tests/                    # pytest test suite
└── local-data-warehouse/     # gitignored β€” curators write raw + saved containers here

πŸ§‘β€πŸ”¬ Contributing a Dataset

The short version:

  1. Copy datasets/_template/_template.ipynb to datasets/_dev/<topic>/<unique_name>/<unique_name>.ipynb.
  2. Run the notebook end-to-end so the saved cells contain populated check tables and the final uuid / checksum.
  3. Open a PR β€” reviewers will move the notebook into the right beyond_iid/ subfolder and append the UUID to datasets/beyond_iid/final_uuid_list.py.

The long version (field-by-field walkthrough, split-helper choice, dtype gotchas, the /new-dataset Claude Code scaffolding skill): see CONTRIBUTING_DATASETS.md.

πŸ“„ Citation

PLACEHOLDER

PLACEHOLDER

About

Repository for all things around tabular dataset curation for benchmarking!

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages