Skip to content

HackingDave/modelregression

Repository files navigation

ModelRegression.com

Independent, automated benchmarking of frontier AI coding models. Tracks performance over time so the community knows when models improve — or regress.

Live site: modelregression.com

Why This Exists

AI model providers update their models constantly, and sometimes performance degrades without any announcement. ModelRegression runs the same 30 tests against every model daily, scores the results, and surfaces regressions automatically. No vendor self-reporting — just independent, reproducible benchmarks.

What Gets Tested

30 tests across 10 categories, each targeting a different dimension of coding ability:

Category What It Measures
Long Reasoning Multi-step logic, legal reasoning, mathematical proofs
Coding Tasks Algorithm implementation, API design, concurrent pipelines
Bug Fixes Race conditions, memory leaks, off-by-one errors
Feature Implementation OAuth2 flows, search autocomplete, webhook systems
Code Thoroughness Edge case coverage, error handling, test completeness
Bug Introduction Rate Refactoring safety, merge conflicts, dependency upgrades
Security Awareness SQL injection, XSS, secret management
Instruction Following Schema compliance, constraint adherence, multi-step chains
Code Quality Idiomatic Python, TypeScript best practices, clean architecture
Performance Efficiency Algorithm complexity, streaming, query optimization

Models Tracked

  • Claude Opus 4.8 (Anthropic) — via claude CLI
  • Claude Sonnet 4.6 (Anthropic) — via claude CLI
  • GPT-5.5 (OpenAI) — via codex CLI
  • Grok (xAI) — via agent CLI

Models are tested through their official CLI tools, not direct API calls. This tests the full stack that developers actually use.

Architecture

                ┌──────────────┐
                │   DGX Sparks │  Runs benchmarks daily (3am ET)
                │   (cron job) │  Python 3.13 + SQLite
                └──────┬───────┘
                       │
            benchmark suite runs 30 tests
            against each model via CLI tools
                       │
                       ▼
                ┌──────────────┐
                │  export_json │  SQLite → static JSON files
                └──────┬───────┘
                       │
                       ▼
                ┌──────────────┐
                │   Next.js 15 │  Static site generation
                │   + deploy   │  Blue-green deploy to Linode
                └──────┬───────┘
                       │
                       ▼
                ┌──────────────┐
                │    Linode    │  Nginx reverse proxy + PM2
                │   (prod)    │  modelregression.com
                └──────────────┘

Website Stack

  • Next.js 15 (App Router) with static site generation
  • TailwindCSS for styling
  • Recharts for interactive charts
  • Framer Motion for animations

Benchmark Engine

  • Python 3.13 orchestrator with parallel test execution
  • SQLite for storing all results, scores, regressions, and outages
  • LLM-as-judge evaluation using Claude Sonnet for subjective tests
  • Sandbox execution for tests with deterministic outputs
  • Regression detection with configurable thresholds and severity levels
  • Outage monitoring with pre-flight health checks before each run

Project Structure

modelregression/
├── app/                          # Next.js pages
│   ├── page.tsx                  #   Dashboard
│   ├── models/[slug]/            #   Per-model detail pages
│   ├── categories/[slug]/        #   Per-category detail pages
│   ├── compare/                  #   Side-by-side model comparison
│   ├── evidence/[runId]/[testId] #   Full test evidence (prompts, outputs, scores)
│   ├── outages/                  #   Outage history + uptime
│   ├── methodology/              #   How benchmarks work
│   └── about/                    #   About the project
├── components/                   # React components
│   ├── charts/                   #   Recharts wrappers
│   ├── dashboard/                #   Dashboard-specific components
│   └── shared/                   #   Navbar, footer, animations
├── lib/                          # Types, utilities, data loading
├── public/data/                  # Generated JSON (from benchmark engine)
├── benchmark/                    # Python benchmark suite
│   ├── runner.py                 #   Main orchestrator
│   ├── config.py                 #   Models, categories, test definitions
│   ├── db.py                     #   SQLite schema + queries
│   ├── export_json.py            #   SQLite → JSON exporter
│   ├── scoring.py                #   Score aggregation + composites
│   ├── regression_detector.py    #   Regression detection logic
│   ├── outage_monitor.py         #   Health checks + outage tracking
│   ├── run_benchmarks.sh         #   Cron entry point (full pipeline)
│   └── tests/                    #   Test implementations (30 tests)
│       ├── base.py               #     Base test class
│       ├── long_reasoning.py     #     3 long reasoning tests
│       ├── coding_tasks.py       #     3 coding task tests
│       ├── bug_fixes.py          #     3 bug fix tests
│       └── ...                   #     (10 categories x 3 tests each)
├── config/                       # Nginx configuration
├── deploy.sh                     # Blue-green atomic deployment
└── ecosystem.config.js           # PM2 process configuration

Setup

Website (Development)

# Install dependencies
npm install

# Start dev server (port 3002)
npm run dev

# Production build
npm run build
npm start

Benchmark Engine

cd benchmark

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Run benchmarks for all models
python runner.py --schedule daily

# Run benchmarks for a single model
python runner.py --schedule daily --model grok

# Export results to JSON for the website
python export_json.py --output ../public/data

CLI Tool Prerequisites

The benchmark engine calls models through their official CLI tools. You need these installed and authenticated:

Deployment

Deployment uses a blue-green strategy with zero-downtime swaps:

# Requires .deploy.env with server credentials (not checked into git)
bash deploy.sh

Automated Daily Runs

The full pipeline (benchmark, export, build, deploy) runs via cron on the DGX:

0 3 * * * /path/to/benchmark/run_benchmarks.sh

How Scoring Works

  1. Each test produces a raw score (0-100) via sandbox execution, LLM-as-judge, or exact match
  2. Scores are averaged per category (3 tests each)
  3. Category averages are combined into a composite score (equal weight)
  4. Failed calls, timeouts, exceptions, and other null test results count as 0 rather than being dropped
  5. Composite scores are ranked across models
  6. Regression detection compares against a rolling window of previous runs
  7. Regressions are classified by severity: minor (>5% drop), moderate (>10%), major (>20%)

Contributing

Found a bug? Have an idea for a new test category? Open an issue or submit a PR.

When adding new tests:

  1. Create a new test class in benchmark/tests/ extending BaseTest
  2. Add the test definition to benchmark/config.py
  3. The runner and exporter pick it up automatically

Acknowledgments

  • Randy Blasik for inspiring this project and suggesting independent, automated model regression tracking
  • Ed Skoudis for the idea of testing if the model has regressed prior to using each day

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors