When thinking makes the model slower and worse: turning extended reasoning off for extraction

Shipped

OneTap Resume started as a web app that tailors a résumé to a job posting and renders a PDF. v0.1.1 turns that whole pipeline into a self-contained Claude Code skill, /onetapresume, that runs locally through the claude CLI on your subscription, so there’s no API key and no per-run cost. Along the way it got live streaming progress, results as tables, a native macOS file picker, and an interactive style picker that re-renders across seven templates in about half a second. It also got cleaned up for other people to use: roughly 1,580 lines of dead code swept, an unused dependency dropped (it was pulling 92 transitive packages), an MIT license, packaging, a rewritten README, and a release workflow that tags and publishes a GitHub Release on merge to master.

The headline, though, was speed. A real run took five to nine minutes and sometimes hung. The fix wasn’t a faster machine or a cleverer prompt; it was noticing that the model was doing a lot of expensive work the task never asked for. What follows is the whole arc end to end: how to find the wasted work, how to build the extraction call so it stops happening, how to replace the in-prompt self-audit with deterministic code, and how to prove the change on evidence rather than vibes.

Diagnose: 800 characters of answer, 16,000 tokens of bill

The first job is to make the waste visible. Extended thinking is billed as output tokens, so if you instrument a call and compare the size of the visible answer to the billed output_tokens, a large gap is the tell. A four-bullet résumé produces about 800 characters of JSON; that is a few hundred tokens of answer, not sixteen thousand.

from anthropic import Anthropic

client = Anthropic()

# Haiku 4.5 uses the classic extended-thinking shape: {type: "enabled",
# budget_tokens: N}. budget_tokens must be < max_tokens, and output_tokens
# is thinking + visible text, so to be billed ~16k the budget has to allow it.
resp = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=20000,
    thinking={"type": "enabled", "budget_tokens": 16000},  # the suspect
    system="Rewrite each résumé bullet for the target job. Return ONLY a JSON array of strings.",
    messages=[{"role": "user", "content": prompt}],
)

answer = "".join(b.text for b in resp.content if b.type == "text")
print(f"answer chars:  {len(answer)}")              # ~800
print(f"output tokens: {resp.usage.output_tokens}") # ~16,000
# ~800 chars of JSON is a few hundred tokens, not 16k. The gap is thinking:
# tokens you pay for and wait on but never see in the answer.

That gap was the two missing minutes. The model spent them reasoning through a rule-dense prompt before emitting a tiny structured answer. Anthropic’s own guidance draws the line exactly here: thinking helps on tasks that need multi-step reasoning, math, complex coding, or multi-constraint analysis, and for everyday work like reformatting or extraction, standard responses are faster and perfectly sufficient (Anthropic: Extended thinking). Their summary is blunt: when in doubt, respond directly (Anthropic: Adaptive thinking). That call is automatic on the 4.6-and-later adaptive models, which decide per turn whether to think; Haiku 4.5 doesn’t, so you make the same call by hand with budget_tokens or {"type": "disabled"}. Tailoring a bullet into structured JSON is extraction with rules, not open-ended reasoning. The prompt was dense, so the model kept thinking, but density of instructions is not the same as difficulty of reasoning, and that gap is where the wasted time lived.

Build the extraction call with thinking off

The fix is to make “don’t think” explicit rather than hoping the model decides not to. For Haiku 4.5 that means passing thinking={"type": "disabled"} instead of leaving the default to chance, and capping max_tokens to the size of the real payload so a runaway answer can’t balloon.

import json
import re
from anthropic import Anthropic

client = Anthropic()

def _parse_json_array(text: str) -> list[str]:
    """Models wrap JSON in ```json fences or a short preamble, and a truncated
    answer won't parse at all. Strip fences, grab the first [...] span, and on
    failure ask once for the bare array (the same repair pattern used below)."""
    def _load(s: str) -> list[str]:
        s = re.sub(r"^```(?:json)?\s*|\s*```$", "", s.strip())
        m = re.search(r"\[.*\]", s, re.DOTALL)
        return json.loads(m.group(0) if m else s)
    try:
        return _load(text)
    except json.JSONDecodeError:
        resp = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=4096,
            thinking={"type": "disabled"},
            system="Return ONLY the JSON array of strings from the input. No fences, no prose.",
            messages=[{"role": "user", "content": text}],
        )
        fixed = "".join(b.text for b in resp.content if b.type == "text")
        # One repair only. If the retry still isn't valid JSON, let _load raise:
        # better to fail loud than hand back a half-parsed array.
        return _load(fixed)

def extract_bullets(prompt: str, bullet_count: int = 26) -> list[str]:
    """Structured extraction: rule-dense, but not reasoning-heavy.

    Thinking is turned OFF explicitly. Extraction with rules is not
    open-ended reasoning, so the model should answer directly instead of
    reasoning out loud first. We default to the fast model and reserve the
    expensive reasoning budget for steps that genuinely branch.
    """
    resp = client.messages.create(
        model="claude-haiku-4-5",            # small model is plenty for extraction
        # Cap to the payload, not 1k: a 26-bullet résumé of JSON overflows 1024
        # tokens and the answer gets truncated. Scale the cap by bullet count.
        max_tokens=max(2048, bullet_count * 160),
        thinking={"type": "disabled"},       # make "don't think" explicit
        system=(
            "Rewrite each résumé bullet for the target job. "
            "Return ONLY a JSON array of strings. No preamble, no explanation."
        ),
        messages=[{"role": "user", "content": prompt}],
    )
    text = "".join(b.text for b in resp.content if b.type == "text")
    return _parse_json_array(text)

This is the simplest-thing-first move that good agent design asks for: start with the least machinery that could work, and add complexity only when a simpler version demonstrably falls short (Anthropic: Building effective agents). For this call, disabling thinking dropped one extraction from about 130 seconds to about 10. The same principle carries to the CLI path: if a phase is extraction rather than reasoning, run it on a fast model with thinking off.

Move the self-audit into deterministic code

There was a second reason the old prompt was thinking so hard. It was asked to police itself. The prompt told the model to check its own output for banned summary phrases, for scope qualifiers that weren’t in the source, and for invented “X years” durations, then fix them. That self-audit ran in thinking tokens, which means I was paying for it, waiting for it, and getting no guarantee it actually happened. A system prompt is a suggestion the model usually follows and sometimes ignores, most reliably right when it matters.

So the audit moves into deterministic post-processing. The same checks run as plain code against the model’s output, with a single targeted corrective retry when a check fails. The shape here follows standard guidance for production LLM output: run validation after generation and feed any violation back as a narrow correction rather than trusting the model to have caught it (Arthur: Best practices for agent guardrails). Keeping those post-generation checks deterministic is my own choice; the ones here are simple string and regex tests with no judgment call, so there’s no reason to spend a model on them. This block and the ones after it assume the client = Anthropic() and the helpers from the earlier blocks are already in scope.

import re

BANNED = ("results-driven", "team player", "proven track record", "synergy")

# Scope qualifiers the model likes to invent. Allowed only if the source says so.
SCOPE_WORDS = ("enterprise-wide", "company-wide", "global", "industry-leading")

def find_violations(bullet: str, source: str) -> list[str]:
    """The checks the prompt used to run 'out loud' in thinking tokens,
    now deterministic code that runs after generation."""
    low, src = bullet.lower(), source.lower()
    problems: list[str] = []
    problems += [f"banned phrase: {p}" for p in BANNED if p in low]
    problems += [f"unsupported scope: {w}" for w in SCOPE_WORDS
                 if w in low and w not in src]
    # Reject fabricated "X years" durations the source can't support. Match on
    # word boundaries: a bare "5 year" substring lives inside "15 years", so a
    # fabricated "5 years" would sail past a naive `in` check against it. This
    # covers numeric durations only: "5 years", "5-year", "5+ yrs". Spelled-out
    # forms ("five years") are out of scope.
    dur = r"\b(\d+)\+?[\s-]*(?:years?|yrs?)\b"
    for n in re.findall(dur, low):
        if not re.search(rf"\b{n}\+?[\s-]*(?:years?|yrs?)\b", src):
            problems.append(f"unsupported duration: {n} years")
    return problems

def repair(bullet: str, problems: list[str], prompt: str) -> str:
    """One targeted correction, not a fresh full run. Thinking stays off."""
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        thinking={"type": "disabled"},
        system="Rewrite the bullet to fix the listed problems. Return ONLY the bullet text.",
        messages=[{"role": "user", "content":
                   f"Original task:\n{prompt}\n\nBullet:\n{bullet}\n\n"
                   "Fix these problems:\n- " + "\n- ".join(problems)}],
    )
    return "".join(b.text for b in resp.content if b.type == "text").strip()

Same guardrails, none of the latency, and now they’re guarantees instead of hopes.

Wire it together: extract, validate, retry

The call site is small because each piece does one thing. Extract the bullets with thinking off, validate each against the source text, and re-prompt only the bullets that fail. One corrective retry is the budget; a bullet that still fails keeps its corrected text but gets flagged so a human can look.

import logging

log = logging.getLogger(__name__)

def tailor(prompt: str, source: str) -> list[str]:
    """extract -> validate -> one corrective retry per bad bullet."""
    clean: list[str] = []
    for bullet in extract_bullets(prompt):
        problems = find_violations(bullet, source)
        if problems:
            bullet = repair(bullet, problems, prompt)
            problems = find_violations(bullet, source)   # re-check the repair
            if problems:
                log.warning("bullet still failing after retry: %s", problems)
        clean.append(bullet)
    return clean

The source text is the ground truth the validator measures against, which is why it threads all the way through; a “5 years” claim is only legitimate if those words appear in what the person actually wrote.

Verify with an eval sweep

It would have been easy to stop at “thinking off felt faster” and ship it. That’s a vibe, not a measurement, and vibes don’t catch the case where a speedup quietly costs you quality. An eval is just that measurement made repeatable: give the system an input, grade the output, and let the grade decide (Anthropic: Demystifying evals for AI agents). So I swept the thinking budget across 0, 4,000, and 8,000 tokens and scored each result.

import time

def score(bullets: list[str], source: str) -> float:
    """Replace with your real rubric. Higher is better.
    Stub: penalize each deterministic violation, so fewer violations wins."""
    return -sum(len(find_violations(b, source)) for b in bullets)

def sweep(prompt: str, source: str, budgets=(0, 4000, 8000)) -> None:
    for budget in budgets:
        thinking = ({"type": "disabled"} if budget == 0
                    else {"type": "enabled", "budget_tokens": budget})
        start = time.monotonic()
        resp = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=16000,            # must exceed budget_tokens
            thinking=thinking,
            system=("Rewrite each résumé bullet for the target job. "
                    "Return ONLY a JSON array of strings."),
            messages=[{"role": "user", "content": prompt}],
        )
        elapsed = time.monotonic() - start
        bullets = _parse_json_array("".join(b.text for b in resp.content if b.type == "text"))
        print(f"budget={budget:>5}  score={score(bullets, source):>4}  "
              f"tokens={resp.usage.output_tokens:>6}  {elapsed:5.1f}s")

The scorer here is a stub on purpose, the real rubric is yours; the point is that the harness, not your gut, picks the winner. The result was clean: zero thinking was both the fastest and the highest-quality setting, and the larger budgets produced zero wins. More thinking made the output slower and worse on this task. On a real 26-bullet résumé the net came out to roughly 13x faster. Picking a configuration on evidence rather than feel is what let me turn off a marquee feature with confidence instead of nerves (Anthropic: Building effective agents). The sweep also caught something I’d never have seen by eye: the previous default, Sonnet on the uncached subscription path, timed out three times out of three, so the CLI now defaults to Haiku.

One cleanup worth calling out: don’t bill a stray key

The skill is meant to run free on your subscription. But if you happen to have ANTHROPIC_API_KEY set in your environment, the CLI uses that key instead of your subscription, and every token gets billed to that API account. That precedence is documented behavior: when the variable is set, Claude Code does not use the subscription account at all (Claude Code: Manage API key environment variables). For a tool whose entire pitch is “no per-run cost,” silently charging a stray key is the opposite of what it promises, so the skill guards the path explicitly rather than inheriting whatever happens to be in the environment.

# The skill runs free on your subscription. A stray ANTHROPIC_API_KEY would
# silently override that and bill the API account, so unset it before the call.
[ -n "$ANTHROPIC_API_KEY" ] && unset ANTHROPIC_API_KEY

The plan is to push tailoring under a minute by trimming the per-bullet before/after payload, which is currently heavier than it needs to be, and to add an eval-gated quality check so a regression in output quality fails the build the same way a broken test would.

Sources

Anthropic: Extended thinking — use thinking for multi-step reasoning; for extraction and reformatting, respond directly.
Anthropic: Adaptive thinking — “when in doubt, respond directly”; the 4.6+ adaptive models make this call per turn, where Haiku 4.5 needs you to set it by hand.
Anthropic: Building effective agents — start simple, measure, add complexity only when a simpler version falls short.
Anthropic: Demystifying evals for AI agents — an eval gives an input, grades the output, and lets the grade decide.
Arthur: Best practices for agent guardrails — validate after generation and re-prompt on violations instead of trusting self-checks (keeping those checks deterministic is my own design choice).
Claude Code: Manage API key environment variables — a set ANTHROPIC_API_KEY overrides the subscription and bills the API account.

Changelog

[onetapskill] feat: streaming UX, thinking-off speedup, style picker; lean dead-code sweep (48b7f9a)
[onetapskill] ci: auto-tag semver release on merge to master (a75b318)
[onetapskill] docs: license, packaging metadata, README for public consumption (e3643fe)
[onetapskill] release: 0.1.1 — license + packaging for public consumption (03f2474)