Skip to content

bench(skill-evals): epochs + 95% CIs, baseline arm, fix errored-run scoring#16

Merged
christopherarter merged 1 commit into
mainfrom
eval/p0-stats-baseline
Jun 14, 2026
Merged

bench(skill-evals): epochs + 95% CIs, baseline arm, fix errored-run scoring#16
christopherarter merged 1 commit into
mainfrom
eval/p0-stats-baseline

Conversation

@christopherarter

Copy link
Copy Markdown
Contributor

Why

A bench audit found the skill-eval harness (bench/run_skill_evals.py) reports numbers that can't be trusted or compared: every eval ran n=1 (so stddev was structurally 0 — no error bars), there was no control arm (only with_skill, so nothing attributed an effect to the skill), and a grader/infra failure was scored as 0% (pass_rate or 0.0), polluting the suite mean. These are the three P0 fixes.

What

Fix Before After
Errored-run scoring grader failure → pass_rate coerced to 0.0, indistinguishable from a real 0% errored runs excluded from rates + counted (errored_runs/errored_epochs); a measurement is None, never 0.0
Epochs + CIs n=1 point estimate, stddev always 0 --epochs Nmean [95% CI] via two-stage aggregation (reduce epochs per eval, then cluster across evals so N epochs don't pose as N independent samples), per Anthropic's Adding Error Bars to Evals
Baseline arm with_skill only --baseline runs a without_skill control (Skill tool disabled) → paired per-eval pass_rate delta, the skill-creator skill-vs-no-skill signal

Also: fixes a latent skill_path aliasing bug in the quality-grader block, stamps git SHA + grader model into benchmark.json metadata for provenance, and rewrites benchmark.md (per-eval CIs, suite-clustered summary, a "Skill effect" block under --baseline).

Example benchmark.md (synthetic, one epoch errored to show exclusion):

| eval_id | name | config | pass_rate | epochs | errored |
| 1 | add-rule | with_skill | 0.93 [0.79, 1.07] (n=2) | 3 | 1 |   <- errored run excluded, not scored 0
...
## Skill effect (with_skill − without_skill, paired on eval)
- pass_rate uplift: +0.39 [0.28, 0.51] (n=2)

Defaults are backward-compatible (--epochs 1, no baseline).

Tests

New tests/test_skill_evals.py (18 tests, written test-first) covers mean_ci, execution_plan, indexed _seed_workspace, and aggregate — including the regression that an errored run must not score 0% and that the suite clusters on eval. bench/ is added to pytest pythonpath so the (previously untested) harness is importable. Perf-bench suite unchanged and green; CI-pinned ruff 0.15.12 clean; untouched functions byte-identical to main.

Scope

Skill-eval harness only. The perf-bench iterations=5/p95 weakness and the P1 roadmap (judge validation, held-out split, CI gating, a precision/recall/FP detection eval) are follow-ups, not in this PR.

🤖 Generated with Claude Code

Three P0 fixes from the bench audit so the skill-eval numbers can be
trusted and compared:

- Multi-epoch sampling (--epochs N): every pass_rate/quality figure now
  reports mean [95% CI] via two-stage aggregation (epochs reduced per
  eval, then clustered across evals so N epochs don't pose as N
  independent samples), replacing the n=1 point estimate whose stddev
  was structurally 0. Follows Anthropic's "Adding Error Bars to Evals".

- Control arm (--baseline): runs a without_skill arm (Skill tool
  disabled) and reports a paired per-eval pass_rate delta -- the skill's
  attributable effect, the skill-creator skill-vs-no-skill signal.

- Errored runs (grader failures, no grading.json) are now excluded from
  rates and counted separately instead of being coerced to 0.0, which
  previously made an infra failure indistinguishable from a 0% score and
  polluted the suite mean.

Pure logic (mean_ci, execution_plan, aggregate, indexed _seed_workspace)
is covered by tests/test_skill_evals.py (18 tests); bench/ added to pytest
pythonpath so the harness is importable. Git SHA + grader model stamped
into benchmark metadata for provenance. Also fixes a latent skill_path
aliasing bug in the quality-grader block. Docs updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@christopherarter christopherarter merged commit b46147e into main Jun 14, 2026
2 checks passed
@christopherarter christopherarter deleted the eval/p0-stats-baseline branch June 14, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant