bench(skill-evals): epochs + 95% CIs, baseline arm, fix errored-run scoring by christopherarter · Pull Request #16 · dynamik-dev/bully

christopherarter · 2026-06-09T18:17:58Z

Why

A bench audit found the skill-eval harness (bench/run_skill_evals.py) reports numbers that can't be trusted or compared: every eval ran n=1 (so stddev was structurally 0 — no error bars), there was no control arm (only with_skill, so nothing attributed an effect to the skill), and a grader/infra failure was scored as 0% (pass_rate or 0.0), polluting the suite mean. These are the three P0 fixes.

What

Fix	Before	After
Errored-run scoring	grader failure → `pass_rate` coerced to `0.0`, indistinguishable from a real 0%	errored runs excluded from rates + counted (`errored_runs`/`errored_epochs`); a measurement is `None`, never `0.0`
Epochs + CIs	`n=1` point estimate, `stddev` always `0`	`--epochs N` → mean [95% CI] via two-stage aggregation (reduce epochs per eval, then cluster across evals so N epochs don't pose as N independent samples), per Anthropic's Adding Error Bars to Evals
Baseline arm	`with_skill` only	`--baseline` runs a `without_skill` control (Skill tool disabled) → paired per-eval pass_rate delta, the skill-creator skill-vs-no-skill signal

Also: fixes a latent skill_path aliasing bug in the quality-grader block, stamps git SHA + grader model into benchmark.json metadata for provenance, and rewrites benchmark.md (per-eval CIs, suite-clustered summary, a "Skill effect" block under --baseline).

Example benchmark.md (synthetic, one epoch errored to show exclusion):

| eval_id | name | config | pass_rate | epochs | errored |
| 1 | add-rule | with_skill | 0.93 [0.79, 1.07] (n=2) | 3 | 1 |   <- errored run excluded, not scored 0
...
## Skill effect (with_skill − without_skill, paired on eval)
- pass_rate uplift: +0.39 [0.28, 0.51] (n=2)

Defaults are backward-compatible (--epochs 1, no baseline).

Tests

New tests/test_skill_evals.py (18 tests, written test-first) covers mean_ci, execution_plan, indexed _seed_workspace, and aggregate — including the regression that an errored run must not score 0% and that the suite clusters on eval. bench/ is added to pytest pythonpath so the (previously untested) harness is importable. Perf-bench suite unchanged and green; CI-pinned ruff 0.15.12 clean; untouched functions byte-identical to main.

Scope

Skill-eval harness only. The perf-bench iterations=5/p95 weakness and the P1 roadmap (judge validation, held-out split, CI gating, a precision/recall/FP detection eval) are follow-ups, not in this PR.

🤖 Generated with Claude Code

Three P0 fixes from the bench audit so the skill-eval numbers can be trusted and compared: - Multi-epoch sampling (--epochs N): every pass_rate/quality figure now reports mean [95% CI] via two-stage aggregation (epochs reduced per eval, then clustered across evals so N epochs don't pose as N independent samples), replacing the n=1 point estimate whose stddev was structurally 0. Follows Anthropic's "Adding Error Bars to Evals". - Control arm (--baseline): runs a without_skill arm (Skill tool disabled) and reports a paired per-eval pass_rate delta -- the skill's attributable effect, the skill-creator skill-vs-no-skill signal. - Errored runs (grader failures, no grading.json) are now excluded from rates and counted separately instead of being coerced to 0.0, which previously made an infra failure indistinguishable from a 0% score and polluted the suite mean. Pure logic (mean_ci, execution_plan, aggregate, indexed _seed_workspace) is covered by tests/test_skill_evals.py (18 tests); bench/ added to pytest pythonpath so the harness is importable. Git SHA + grader model stamped into benchmark metadata for provenance. Also fixes a latent skill_path aliasing bug in the quality-grader block. Docs updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

christopherarter merged commit b46147e into main Jun 14, 2026
2 checks passed

christopherarter deleted the eval/p0-stats-baseline branch June 14, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench(skill-evals): epochs + 95% CIs, baseline arm, fix errored-run scoring#16

bench(skill-evals): epochs + 95% CIs, baseline arm, fix errored-run scoring#16
christopherarter merged 1 commit into
mainfrom
eval/p0-stats-baseline

christopherarter commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christopherarter commented Jun 9, 2026

Why

What

Tests

Scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant