Independent, automated benchmarking of frontier AI coding models. Tracks performance over time so the community knows when models improve — or regress.
Live site: modelregression.com
AI model providers update their models constantly, and sometimes performance degrades without any announcement. ModelRegression runs the same 30 tests against every model daily, scores the results, and surfaces regressions automatically. No vendor self-reporting — just independent, reproducible benchmarks.
30 tests across 10 categories, each targeting a different dimension of coding ability:
| Category | What It Measures |
|---|---|
| Long Reasoning | Multi-step logic, legal reasoning, mathematical proofs |
| Coding Tasks | Algorithm implementation, API design, concurrent pipelines |
| Bug Fixes | Race conditions, memory leaks, off-by-one errors |
| Feature Implementation | OAuth2 flows, search autocomplete, webhook systems |
| Code Thoroughness | Edge case coverage, error handling, test completeness |
| Bug Introduction Rate | Refactoring safety, merge conflicts, dependency upgrades |
| Security Awareness | SQL injection, XSS, secret management |
| Instruction Following | Schema compliance, constraint adherence, multi-step chains |
| Code Quality | Idiomatic Python, TypeScript best practices, clean architecture |
| Performance Efficiency | Algorithm complexity, streaming, query optimization |
- Claude Opus 4.8 (Anthropic) — via
claudeCLI - Claude Sonnet 4.6 (Anthropic) — via
claudeCLI - GPT-5.5 (OpenAI) — via
codexCLI - Grok (xAI) — via
agentCLI
Models are tested through their official CLI tools, not direct API calls. This tests the full stack that developers actually use.
┌──────────────┐
│ DGX Sparks │ Runs benchmarks daily (3am ET)
│ (cron job) │ Python 3.13 + SQLite
└──────┬───────┘
│
benchmark suite runs 30 tests
against each model via CLI tools
│
▼
┌──────────────┐
│ export_json │ SQLite → static JSON files
└──────┬───────┘
│
▼
┌──────────────┐
│ Next.js 15 │ Static site generation
│ + deploy │ Blue-green deploy to Linode
└──────┬───────┘
│
▼
┌──────────────┐
│ Linode │ Nginx reverse proxy + PM2
│ (prod) │ modelregression.com
└──────────────┘
- Next.js 15 (App Router) with static site generation
- TailwindCSS for styling
- Recharts for interactive charts
- Framer Motion for animations
- Python 3.13 orchestrator with parallel test execution
- SQLite for storing all results, scores, regressions, and outages
- LLM-as-judge evaluation using Claude Sonnet for subjective tests
- Sandbox execution for tests with deterministic outputs
- Regression detection with configurable thresholds and severity levels
- Outage monitoring with pre-flight health checks before each run
modelregression/
├── app/ # Next.js pages
│ ├── page.tsx # Dashboard
│ ├── models/[slug]/ # Per-model detail pages
│ ├── categories/[slug]/ # Per-category detail pages
│ ├── compare/ # Side-by-side model comparison
│ ├── evidence/[runId]/[testId] # Full test evidence (prompts, outputs, scores)
│ ├── outages/ # Outage history + uptime
│ ├── methodology/ # How benchmarks work
│ └── about/ # About the project
├── components/ # React components
│ ├── charts/ # Recharts wrappers
│ ├── dashboard/ # Dashboard-specific components
│ └── shared/ # Navbar, footer, animations
├── lib/ # Types, utilities, data loading
├── public/data/ # Generated JSON (from benchmark engine)
├── benchmark/ # Python benchmark suite
│ ├── runner.py # Main orchestrator
│ ├── config.py # Models, categories, test definitions
│ ├── db.py # SQLite schema + queries
│ ├── export_json.py # SQLite → JSON exporter
│ ├── scoring.py # Score aggregation + composites
│ ├── regression_detector.py # Regression detection logic
│ ├── outage_monitor.py # Health checks + outage tracking
│ ├── run_benchmarks.sh # Cron entry point (full pipeline)
│ └── tests/ # Test implementations (30 tests)
│ ├── base.py # Base test class
│ ├── long_reasoning.py # 3 long reasoning tests
│ ├── coding_tasks.py # 3 coding task tests
│ ├── bug_fixes.py # 3 bug fix tests
│ └── ... # (10 categories x 3 tests each)
├── config/ # Nginx configuration
├── deploy.sh # Blue-green atomic deployment
└── ecosystem.config.js # PM2 process configuration
# Install dependencies
npm install
# Start dev server (port 3002)
npm run dev
# Production build
npm run build
npm startcd benchmark
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Run benchmarks for all models
python runner.py --schedule daily
# Run benchmarks for a single model
python runner.py --schedule daily --model grok
# Export results to JSON for the website
python export_json.py --output ../public/dataThe benchmark engine calls models through their official CLI tools. You need these installed and authenticated:
- Claude Code (
claude) — for Anthropic models - Codex CLI (
codex) — for OpenAI models - Grok Agent (
agent) — for xAI models
Deployment uses a blue-green strategy with zero-downtime swaps:
# Requires .deploy.env with server credentials (not checked into git)
bash deploy.shThe full pipeline (benchmark, export, build, deploy) runs via cron on the DGX:
0 3 * * * /path/to/benchmark/run_benchmarks.sh
- Each test produces a raw score (0-100) via sandbox execution, LLM-as-judge, or exact match
- Scores are averaged per category (3 tests each)
- Category averages are combined into a composite score (equal weight)
- Failed calls, timeouts, exceptions, and other null test results count as 0 rather than being dropped
- Composite scores are ranked across models
- Regression detection compares against a rolling window of previous runs
- Regressions are classified by severity: minor (>5% drop), moderate (>10%), major (>20%)
Found a bug? Have an idea for a new test category? Open an issue or submit a PR.
When adding new tests:
- Create a new test class in
benchmark/tests/extendingBaseTest - Add the test definition to
benchmark/config.py - The runner and exporter pick it up automatically
- Randy Blasik for inspiring this project and suggesting independent, automated model regression tracking
- Ed Skoudis for the idea of testing if the model has regressed prior to using each day
MIT