ModelRegression.com

Independent, automated benchmarking of frontier AI coding models. Tracks performance over time so the community knows when models improve — or regress.

Live site: modelregression.com

Why This Exists

AI model providers update their models constantly, and sometimes performance degrades without any announcement. ModelRegression runs the same 30 tests against every model daily, scores the results, and surfaces regressions automatically. No vendor self-reporting — just independent, reproducible benchmarks.

What Gets Tested

30 tests across 10 categories, each targeting a different dimension of coding ability:

Category	What It Measures
Long Reasoning	Multi-step logic, legal reasoning, mathematical proofs
Coding Tasks	Algorithm implementation, API design, concurrent pipelines
Bug Fixes	Race conditions, memory leaks, off-by-one errors
Feature Implementation	OAuth2 flows, search autocomplete, webhook systems
Code Thoroughness	Edge case coverage, error handling, test completeness
Bug Introduction Rate	Refactoring safety, merge conflicts, dependency upgrades
Security Awareness	SQL injection, XSS, secret management
Instruction Following	Schema compliance, constraint adherence, multi-step chains
Code Quality	Idiomatic Python, TypeScript best practices, clean architecture
Performance Efficiency	Algorithm complexity, streaming, query optimization

Models Tracked

Claude Opus 4.8 (Anthropic) — via claude CLI
Claude Sonnet 4.6 (Anthropic) — via claude CLI
GPT-5.5 (OpenAI) — via codex CLI
Grok (xAI) — via agent CLI

Models are tested through their official CLI tools, not direct API calls. This tests the full stack that developers actually use.

Architecture

                ┌──────────────┐
                │   DGX Sparks │  Runs benchmarks daily (3am ET)
                │   (cron job) │  Python 3.13 + SQLite
                └──────┬───────┘
                       │
            benchmark suite runs 30 tests
            against each model via CLI tools
                       │
                       ▼
                ┌──────────────┐
                │  export_json │  SQLite → static JSON files
                └──────┬───────┘
                       │
                       ▼
                ┌──────────────┐
                │   Next.js 15 │  Static site generation
                │   + deploy   │  Blue-green deploy to Linode
                └──────┬───────┘
                       │
                       ▼
                ┌──────────────┐
                │    Linode    │  Nginx reverse proxy + PM2
                │   (prod)    │  modelregression.com
                └──────────────┘

Website Stack

Next.js 15 (App Router) with static site generation
TailwindCSS for styling
Recharts for interactive charts
Framer Motion for animations

Benchmark Engine

Python 3.13 orchestrator with parallel test execution
SQLite for storing all results, scores, regressions, and outages
LLM-as-judge evaluation using Claude Sonnet for subjective tests
Sandbox execution for tests with deterministic outputs
Regression detection with configurable thresholds and severity levels
Outage monitoring with pre-flight health checks before each run

Project Structure

modelregression/
├── app/                          # Next.js pages
│   ├── page.tsx                  #   Dashboard
│   ├── models/[slug]/            #   Per-model detail pages
│   ├── categories/[slug]/        #   Per-category detail pages
│   ├── compare/                  #   Side-by-side model comparison
│   ├── evidence/[runId]/[testId] #   Full test evidence (prompts, outputs, scores)
│   ├── outages/                  #   Outage history + uptime
│   ├── methodology/              #   How benchmarks work
│   └── about/                    #   About the project
├── components/                   # React components
│   ├── charts/                   #   Recharts wrappers
│   ├── dashboard/                #   Dashboard-specific components
│   └── shared/                   #   Navbar, footer, animations
├── lib/                          # Types, utilities, data loading
├── public/data/                  # Generated JSON (from benchmark engine)
├── benchmark/                    # Python benchmark suite
│   ├── runner.py                 #   Main orchestrator
│   ├── config.py                 #   Models, categories, test definitions
│   ├── db.py                     #   SQLite schema + queries
│   ├── export_json.py            #   SQLite → JSON exporter
│   ├── scoring.py                #   Score aggregation + composites
│   ├── regression_detector.py    #   Regression detection logic
│   ├── outage_monitor.py         #   Health checks + outage tracking
│   ├── run_benchmarks.sh         #   Cron entry point (full pipeline)
│   └── tests/                    #   Test implementations (30 tests)
│       ├── base.py               #     Base test class
│       ├── long_reasoning.py     #     3 long reasoning tests
│       ├── coding_tasks.py       #     3 coding task tests
│       ├── bug_fixes.py          #     3 bug fix tests
│       └── ...                   #     (10 categories x 3 tests each)
├── config/                       # Nginx configuration
├── deploy.sh                     # Blue-green atomic deployment
└── ecosystem.config.js           # PM2 process configuration

Setup

Website (Development)

# Install dependencies
npm install

# Start dev server (port 3002)
npm run dev

# Production build
npm run build
npm start

Benchmark Engine

cd benchmark

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Run benchmarks for all models
python runner.py --schedule daily

# Run benchmarks for a single model
python runner.py --schedule daily --model grok

# Export results to JSON for the website
python export_json.py --output ../public/data

CLI Tool Prerequisites

The benchmark engine calls models through their official CLI tools. You need these installed and authenticated:

Claude Code (claude) — for Anthropic models
Codex CLI (codex) — for OpenAI models
Grok Agent (agent) — for xAI models

Deployment

Deployment uses a blue-green strategy with zero-downtime swaps:

# Requires .deploy.env with server credentials (not checked into git)
bash deploy.sh

Automated Daily Runs

The full pipeline (benchmark, export, build, deploy) runs via cron on the DGX:

0 3 * * * /path/to/benchmark/run_benchmarks.sh

How Scoring Works

Each test produces a raw score (0-100) via sandbox execution, LLM-as-judge, or exact match
Scores are averaged per category (3 tests each)
Category averages are combined into a composite score (equal weight)
Failed calls, timeouts, exceptions, and other null test results count as 0 rather than being dropped
Composite scores are ranked across models
Regression detection compares against a rolling window of previous runs
Regressions are classified by severity: minor (>5% drop), moderate (>10%), major (>20%)

Contributing

Found a bug? Have an idea for a new test category? Open an issue or submit a PR.

When adding new tests:

Create a new test class in benchmark/tests/ extending BaseTest
Add the test definition to benchmark/config.py
The runner and exporter pick it up automatically

Acknowledgments

Randy Blasik for inspiring this project and suggesting independent, automated model regression tracking
Ed Skoudis for the idea of testing if the model has regressed prior to using each day

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
benchmark		benchmark
components		components
config		config
lib		lib
public		public
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
deploy.sh		deploy.sh
ecosystem.config.js		ecosystem.config.js
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
setup-ssl.sh		setup-ssl.sh
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelRegression.com

Why This Exists

What Gets Tested

Models Tracked

Architecture

Website Stack

Benchmark Engine

Project Structure

Setup

Website (Development)

Benchmark Engine

CLI Tool Prerequisites

Deployment

Automated Daily Runs

How Scoring Works

Contributing

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ModelRegression.com

Why This Exists

What Gets Tested

Models Tracked

Architecture

Website Stack

Benchmark Engine

Project Structure

Setup

Website (Development)

Benchmark Engine

CLI Tool Prerequisites

Deployment

Automated Daily Runs

How Scoring Works

Contributing

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages