A benchmark you'll actually run, and a retry that stopped redoing everything

Shipped

The resume skill tailors a résumé to a job, and v0.2.0 gives it two things it was missing: a benchmark you can trust, and a retry that stops wasting time. npm run benchmark runs the pipeline against seven real jobs and scores the output, with the judge running through the Claude CLI so it’s intended to run under a Pro or Max subscription rather than a metered per-run API key, and the gate looking only at the treatment so a noisy baseline can’t move the verdict. (More on that “intended to” below: programmatic CLI use can still bill per token, so it’s worth confirming.) Separately, the slow retry path went from about 39 seconds to about 5.3 seconds by retrying only the summary instead of the whole thing.

Both changes are the same idea wearing two hats: make the feedback loop cheap enough that you actually use it. The part worth writing about is how the benchmark is built so it survives daily use, and how the retry got fast without changing its output. Here is the end-to-end build, from the fixtures and rubric through the gate that fails the build on a regression.

Setup: the fixtures and the rubric

A benchmark needs two things before any code runs: a set of cases that look like real work, and a definition of what “good” means on each case. Anthropic’s guidance is to build evals around what actually matters and to keep the cases representative rather than convenient (Anthropic: Demystifying evals for AI agents). For this skill a case is one real job description paired with the base résumé, and “good” is a small scoring rubric the judge applies to the tailored output.

The fixtures live on disk so they’re easy to add to and easy to read. Each one is a directory with the job and the résumé it should be tailored from.

# benchmark/fixtures.py
from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class Fixture:
    key: str
    job: str       # the job description text
    resume: str    # the base résumé to tailor from

FIXTURE_DIR = Path(__file__).parent / "fixtures"

def load_fixtures() -> list[Fixture]:
    """Each fixture is a real job+resume pair under benchmark/fixtures/<key>/."""
    fixtures = []
    for d in sorted(p for p in FIXTURE_DIR.iterdir() if p.is_dir()):
        fixtures.append(Fixture(
            key=d.name,
            job=(d / "job.md").read_text(),
            resume=(d / "resume.md").read_text(),
        ))
    return fixtures

The rubric is the part that turns a vibe into a number. Spelling the dimensions out, and weighting them, is what makes two runs comparable. Truthfulness carries real weight here because a résumé that invents experience is worse than one that’s merely generic.

# benchmark/rubric.py
RUBRIC = """
Score the TAILORED résumé against the JOB, 1-5 on each criterion:

- relevance:    do the bullets surface the experience the job asks for?
- specificity:  concrete tools and outcomes, not generic filler?
- truthfulness: is every claim traceable to the base résumé (nothing invented)?
- coverage:     are the job's must-have requirements addressed?

Return ONLY JSON: {"relevance":N,"specificity":N,"truthfulness":N,"coverage":N}
"""

WEIGHTS = {"relevance": 0.35, "specificity": 0.20, "truthfulness": 0.30, "coverage": 0.15}

def weighted_score(scores: dict[str, int]) -> float:
    """Collapse the rubric dimensions into one 1-5 number."""
    return sum(scores[dim] * w for dim, w in WEIGHTS.items())

Build: a judge that runs through the Claude CLI

The judge is an LLM grading the output against the rubric, which is a standard way to score open-ended text that has no single correct answer (Anthropic: Demystifying evals for AI agents). The design choice that matters is where it runs. Calling the metered API for every fixture on every change adds a small bill to a thing I want to run constantly, and a cost per run is exactly the friction that turns “run the benchmark” into “skip the benchmark this time.” Running the judge through the Claude CLI is intended to run under a Claude Pro or Max subscription rather than a metered per-run API key.

Treat that as the goal, not a guarantee. Programmatic claude -p use has documented cases of per-token billing even with an active subscription over OAuth and no ANTHROPIC_API_KEY set (claude-code #43333, #37686); and if the CLI is authenticated with an ANTHROPIC_API_KEY it is metered for certain. Before assuming a run costs nothing, confirm it: claude -p --output-format json is the documented programmatic path, and its envelope reports a total_cost_usd you can read. That JSON envelope still puts the model’s reply in a free-text .result field, so the defensive extraction below applies either way.

The other thing the CLI forces you to handle is the output. claude -p returns the model’s plain text reply, not guaranteed-bare JSON: it may wrap the object in a fenced JSON code block or a sentence of preamble, so json.loads(proc.stdout) will intermittently raise. So the judge extracts the JSON defensively, pulling the outermost {...} span out of whatever the model returned before parsing it.

# benchmark/judge.py
import json, re, subprocess
from benchmark.rubric import RUBRIC, WEIGHTS, weighted_score

def _extract_scores(text: str) -> dict:
    """`claude -p` returns free text, not guaranteed-bare JSON: the reply may carry
    a fenced JSON block or a line of preamble. Pull out the outermost {...} span and parse it."""
    match = re.search(r"\{.*\}", text, re.DOTALL)
    if not match:
        raise ValueError(f"no JSON object in judge reply: {text!r}")
    scores = json.loads(match.group(0))
    missing = WEIGHTS.keys() - scores.keys()   # every rubric dimension must be present
    if missing:
        raise ValueError(f"judge reply missing rubric keys {sorted(missing)}: {scores!r}")
    return scores

def judge(job: str, tailored: str) -> float:
    """Score one tailored résumé via the Claude CLI. Intended to run under a Claude
    Pro/Max subscription, but programmatic -p use can still bill per token; confirm
    actual cost with `--output-format json`'s total_cost_usd rather than assuming zero."""
    prompt = f"{RUBRIC}\n\nJOB:\n{job}\n\nTAILORED RESUME:\n{tailored}"
    proc = subprocess.run(
        ["claude", "-p", prompt],          # CLI invocation, not the metered API
        capture_output=True, text=True, check=True,
    )
    scores = _extract_scores(proc.stdout)  # extract JSON defensively; the reply isn't bare JSON
    return weighted_score(scores)

The runner ties it together: load the fixtures, tailor each one, judge the result, and report the mean. The thing being measured is the tailored output, the treatment, and nothing else is scored.

# benchmark/run.py
import statistics
import sys
from benchmark.fixtures import load_fixtures
from benchmark.judge import judge
from resume.pipeline import tailor        # the thing under test

THRESHOLD = 3.5                            # minimum acceptable mean treatment score (1-5)

def run() -> float:
    scores = []
    for fx in load_fixtures():
        tailored = tailor(fx.resume, fx.job).document   # treatment = the tailored output
        score = judge(fx.job, tailored)
        scores.append(score)
        print(f"{fx.key:<24} {score:.2f}")
    mean = statistics.fmean(scores)
    print(f"\ntreatment mean: {mean:.2f} over {len(scores)} fixtures")
    return mean

Gate on the treatment, not the delta

The tempting design is to also tailor with the feature off, score that baseline, and gate on the difference. It reads as more rigorous because it’s a controlled comparison. The problem is that the baseline score is itself a judge call, and a judge call has run-to-run variance. A lucky baseline draw can make a flat change look like a win, and an unlucky one can fail a change that’s actually fine. The verdict ends up partly determined by noise in a number I don’t even care about.

So the gate watches only the treatment, against an absolute threshold. The baseline is still worth logging for context, but it doesn’t get a vote on whether the build passes. Removing the baseline’s variance from the decision is what keeps the gate from crying wolf, and a gate that cries wolf gets ignored, which is the same as having no gate. The whole point of an eval is the repeatable read that gates the next change, not the one-time score (Anthropic: Demystifying evals for AI agents).

Build: a retry that redoes only what changed

The pipeline has two stages. Rewriting each role’s bullets is the slow, expensive part; writing the summary off those bullets is quick. The old retry reran both, which meant a wrong summary cost a full pipeline run, and it sometimes emitted no-op bullets that padded the résumé with lines that said nothing. Retrying only the summary takes the slow case from about 39 seconds to about 5.3 seconds for the same output.

This is incremental computation in miniature: when one input changes, recompute the part that depends on it and reuse the rest instead of rerunning the whole pipeline (Incremental computing). To make the reuse possible, tailor returns its intermediate result, the bullets, rather than only the assembled document, so a retry can hand them straight back.

# resume/pipeline.py
from dataclasses import dataclass

@dataclass
class Tailored:
    bullets: list[str]
    summary: str
    document: str

def tailor(resume: str, job: str) -> Tailored:
    bullets = rewrite_bullets(resume, job)    # ~34s: rewrites every role's bullets
    bullets = [b for b in bullets if b.strip()]   # drop no-op bullets instead of padding
    summary = write_summary(bullets, job)     # ~5s: a short summary derived from the bullets
    return Tailored(bullets, summary, assemble(bullets, summary))

def retry_summary(prev: Tailored, job: str) -> Tailored:
    """The summary was the only thing wrong. Reuse the expensive bullets verbatim."""
    summary = write_summary(prev.bullets, job)          # recompute only the cheap stage
    return Tailored(prev.bullets, summary, assemble(prev.bullets, summary))

rewrite_bullets, write_summary, assemble, and write_pdf are your own pipeline stages here, treated as black boxes; substitute whatever your skill already does to produce bullets, a summary, an assembled document, and a PDF. And for python -m benchmark.run to resolve these imports, benchmark/ and resume/ both need to be importable packages from the repo root, each with an __init__.py.

The bullets are the memoized expensive result. A retry that recomputes them from scratch isn’t a retry, it’s a fresh run wearing a retry label.

Use it

The benchmark gets an npm entry point ("benchmark": "python -m benchmark.run") so it sits next to the rest of the project’s scripts and runs the same command in CI. The subscription auth doesn’t carry into CI, though: a runner has no interactive Pro/Max login, so in practice CI authenticates the CLI with an ANTHROPIC_API_KEY secret (and accepts that those runs are metered) or with a long-lived OAuth token. Local runs lean on the subscription; CI pays the per-token rate, so size the fixture set with that in mind. Running npm run benchmark prints a line per fixture and the mean, then a PASS/FAIL on the threshold:

backend-platform-sre     4.10
data-platform-eng        3.70
staff-devops             4.25
...
treatment mean: 3.95 over 7 fixtures
PASS

The retry is wired in at the call site. Tailor once, check the cheap stage, and if only the summary is off, redo just that.

# resume/cli.py
def summary_is_acceptable(summary: str) -> bool:
    # Reject empties, stubs, and obviously-truncated output; otherwise keep it.
    return bool(summary.strip()) and "TODO" not in summary and len(summary) >= 40

result = tailor(resume, job)
if not summary_is_acceptable(result.summary):
    result = retry_summary(result, job)   # ~5.3s, reuses the bullets from the first run
write_pdf(result.document)

Verify: watch the gate fail

The threshold check is factored into a gate function so a test can call it directly, and main wires it to the process exit code. A mean below the line makes gate return False, which exits non-zero, which fails npm run benchmark and the CI step that calls it.

# benchmark/run.py (same file, continued)

def gate(mean: float) -> bool:
    """The threshold check, factored out so a test can assert on it directly."""
    if mean < THRESHOLD:
        print(f"FAIL: treatment {mean:.2f} < threshold {THRESHOLD}", file=sys.stderr)
        return False
    print("PASS")
    return True

if __name__ == "__main__":
    sys.exit(0 if gate(run()) else 1)      # non-zero exit fails the build

The behavior worth pinning is that a regression actually fails, not just lowers a number nobody reads. Force the judge to return a regressed score and assert the gate trips. Because the gate only reads the treatment, the test needs no baseline; everything else that’s slow or non-deterministic gets stubbed (the fixtures, tailor, and the judge) so the test exercises only the gate’s decision.

# tests/test_gate.py
import benchmark.run as run
from benchmark.fixtures import Fixture
from resume.pipeline import Tailored

def test_gate_fails_when_treatment_regresses(monkeypatch):
    # Stub the slow/non-deterministic pieces so only the gate logic runs.
    monkeypatch.setattr(run, "load_fixtures",
                        lambda: [Fixture(f"fx{i}", "job", "resume") for i in range(3)])
    monkeypatch.setattr(run, "tailor", lambda resume, job: Tailored([], "", ""))
    monkeypatch.setattr(run, "judge", lambda job, out: 2.0)   # simulate a quality regression

    mean = run.run()
    assert mean < run.THRESHOLD     # below the line
    assert run.gate(mean) is False  # so the gate fails the build (main() exits non-zero)

This fails the moment the treatment drops below the threshold, which is the whole job: turn a quiet quality slip into a red build before it ships.

Leaning on the benchmark to catch quality regressions automatically, and using its per-stage timings to justify the next round of speed work rather than guessing where the time goes. That’s the same discipline as profiling before optimizing, pointed at a model pipeline instead of a tight loop (Program optimization). The 39-to-5.3 win came from knowing which stage held the time, and the next one will come the same way.

Sources

Anthropic: Demystifying evals for AI agents — build evals around what matters, grade open-ended output with an LLM judge, and run them in the loop.
Incremental computing — recompute only what changed and reuse the rest instead of rerunning the whole pipeline.
Program optimization (Knuth on premature optimization) — measure before optimizing; let data point at the bottleneck.

Changelog

feat(resume): accuracy + speed benchmark + pipeline improvements (#11) (1262519)