bench(skill-evals): epochs + 95% CIs, baseline arm, fix errored-run scoring#16
Merged
Merged
Conversation
Three P0 fixes from the bench audit so the skill-eval numbers can be trusted and compared: - Multi-epoch sampling (--epochs N): every pass_rate/quality figure now reports mean [95% CI] via two-stage aggregation (epochs reduced per eval, then clustered across evals so N epochs don't pose as N independent samples), replacing the n=1 point estimate whose stddev was structurally 0. Follows Anthropic's "Adding Error Bars to Evals". - Control arm (--baseline): runs a without_skill arm (Skill tool disabled) and reports a paired per-eval pass_rate delta -- the skill's attributable effect, the skill-creator skill-vs-no-skill signal. - Errored runs (grader failures, no grading.json) are now excluded from rates and counted separately instead of being coerced to 0.0, which previously made an infra failure indistinguishable from a 0% score and polluted the suite mean. Pure logic (mean_ci, execution_plan, aggregate, indexed _seed_workspace) is covered by tests/test_skill_evals.py (18 tests); bench/ added to pytest pythonpath so the harness is importable. Git SHA + grader model stamped into benchmark metadata for provenance. Also fixes a latent skill_path aliasing bug in the quality-grader block. Docs updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
A bench audit found the skill-eval harness (
bench/run_skill_evals.py) reports numbers that can't be trusted or compared: every eval ran n=1 (sostddevwas structurally0— no error bars), there was no control arm (onlywith_skill, so nothing attributed an effect to the skill), and a grader/infra failure was scored as 0% (pass_rate or 0.0), polluting the suite mean. These are the three P0 fixes.What
pass_ratecoerced to0.0, indistinguishable from a real 0%errored_runs/errored_epochs); a measurement isNone, never0.0n=1point estimate,stddevalways0--epochs N→ mean [95% CI] via two-stage aggregation (reduce epochs per eval, then cluster across evals so N epochs don't pose as N independent samples), per Anthropic's Adding Error Bars to Evalswith_skillonly--baselineruns awithout_skillcontrol (Skill tool disabled) → paired per-eval pass_rate delta, the skill-creator skill-vs-no-skill signalAlso: fixes a latent
skill_pathaliasing bug in the quality-grader block, stamps git SHA + grader model intobenchmark.jsonmetadata for provenance, and rewritesbenchmark.md(per-eval CIs, suite-clustered summary, a "Skill effect" block under--baseline).Example
benchmark.md(synthetic, one epoch errored to show exclusion):Defaults are backward-compatible (
--epochs 1, no baseline).Tests
New
tests/test_skill_evals.py(18 tests, written test-first) coversmean_ci,execution_plan, indexed_seed_workspace, andaggregate— including the regression that an errored run must not score 0% and that the suite clusters on eval.bench/is added to pytestpythonpathso the (previously untested) harness is importable. Perf-bench suite unchanged and green; CI-pinned ruff 0.15.12 clean; untouched functions byte-identical tomain.Scope
Skill-eval harness only. The perf-bench
iterations=5/p95 weakness and the P1 roadmap (judge validation, held-out split, CI gating, a precision/recall/FP detection eval) are follow-ups, not in this PR.🤖 Generated with Claude Code