The brief got ~2.5-3x faster, and the fancy idea lost to a measurement
Shipped
The daily brief took about four minutes to write, and that is a number you feel every morning. v0.6.0 takes it to roughly 80 seconds with equal-or-better quality, about 2.5-3x faster. The change was two small moves: dial the model’s reasoning effort down for this output-bound task, and move table formatting out of the model and into code. The part worth writing about is how I picked those moves, because almost every step contradicted what I expected. Here is the end-to-end build: profile where the time goes, set the effort, render tables deterministically, generate a brief through the faster path, and verify the quality with a blind A/B.
Profile before you optimize
The obvious fix looked like parallelism: split the one big generation into a fan-out and compose the cards concurrently. I designed it, ran the design through a quality gate, and then measured before building anything. The measurement killed it. Concurrent calls only ran 1.44x faster at three-wide, under a bar I had set in advance, so fan-out would have added a pile of moving parts to buy almost nothing. This is Knuth’s old warning with a fresh coat of paint: profile first, because the small efficiencies are not where the time is (Program optimization). It rhymes with Anthropic’s advice for agent systems too, find the simplest thing that works and only add complexity when it demonstrably earns its keep (Anthropic: Building effective agents).
So instead of building the fan-out, I instrumented a real brief to see where the wall-clock actually went. A phase timer around each step, plus the token usage off the response, is enough to find the bottleneck. This assumes the Anthropic SDK is installed (pip install anthropic) with ANTHROPIC_API_KEY set in your environment.
import time
from contextlib import contextmanager
from fitness import db
from fitness.brief import generate_brief, brief_text # the Claude call (built below)
class PhaseTimer:
"""Accumulate wall-clock per named phase so you can see the split."""
def __init__(self):
self.phases: dict[str, float] = {}
@contextmanager
def phase(self, name: str):
start = time.perf_counter()
try:
yield
finally:
self.phases[name] = self.phases.get(name, 0.0) + (time.perf_counter() - start)
def report(self):
total = sum(self.phases.values()) or 1.0
for name, secs in sorted(self.phases.items(), key=lambda kv: -kv[1]):
print(f"{name:16} {secs:7.2f}s {secs / total:5.1%}")
timer = PhaseTimer()
with timer.phase("db_snapshot"):
snapshot = db.snapshot_for(None) # most recent day
with timer.phase("model_generate"):
message = generate_brief(snapshot) # one Claude call
timer.report()
# Reasoning is billed in output tokens but never shown to the reader. Compare the
# total output against the visible brief to see how much was hidden thinking.
visible = brief_text(message)
print("output tokens:", message.usage.output_tokens)
print("visible chars:", len(visible))
The result was lopsided. The single model-generation phase was essentially the whole runtime, and most of the output tokens were reasoning the reader never sees. The database tools, the part I’d have bet on, were a rounding error in milliseconds. The bottleneck wasn’t the plumbing, it was the model thinking far more than this task needed.
The lever was reasoning effort, set low
The lever that worked wasn’t the one I reached for. On Opus 4.8 a fixed thinking budget (budget_tokens) is rejected with a 400; thinking is adaptive-only, and depth is controlled by output_config.effort. So the lever is effort, not a token budget. Effort controls how much the model thinks before it answers, and the default is high (Claude: the effort parameter). A daily brief is output-bound: the hard part is writing well, not reasoning deeply, so high effort spends tokens on hidden deliberation that the brief doesn’t benefit from. Dropping it to low cut the brief to about 80 to 95 seconds.
import anthropic
client = anthropic.Anthropic()
COACH_SYSTEM = "You are my training coach. Write a short, specific daily brief."
def generate_brief(snapshot: dict, effort: str = "low") -> anthropic.types.Message:
"""One model call, reasoning effort dialed to LOW for the daily path.
effort defaults to "high"; "low" spends fewer tokens on hidden reasoning and
more on the brief itself. It's a parameter, not a prompt trick, so the change
is one line and is easy to sweep in the A/B below.
max_tokens is set high on purpose: on Opus 4.8 reasoning is billed as output
tokens and counts against this ceiling. At high effort the model can spend
thousands of tokens thinking before it writes a word, so a tight ceiling
would truncate the brief mid-thought (or before it starts) and come back
with stop_reason="max_tokens". The headroom lets the visible brief survive
even a large high-effort thinking pass."""
return client.messages.create(
model="claude-opus-4-8",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": effort}, # the lever that moved the needle
system=COACH_SYSTEM,
messages=[{"role": "user", "content": f"Write today's brief:\n{snapshot}"}],
)
def brief_text(message: anthropic.types.Message) -> str:
"""The visible prose, with thinking blocks left out.
Guard against a truncated generation: if the model hit the token ceiling
(stop_reason="max_tokens"), the visible text is partial or empty, so refuse
it here rather than let a half-written brief reach the saved output."""
if message.stop_reason == "max_tokens":
raise RuntimeError("brief truncated at max_tokens; raise the ceiling")
return "".join(b.text for b in message.content if b.type == "text")
That is the counterintuitive result worth keeping, and the A/B below is what made me trust it. More reasoning was not more quality. Past the point the task needed, the extra effort padded the output instead of sharpening it, a small case of perfect being the enemy of good (Perfect is the enemy of good).
Move table rendering into code
Low effort had one tell: the model occasionally dropped a row break and collapsed a markdown table into one unreadable line. The durable fix is to stop hoping the model samples the row breaks right. Formatting that has to be correct every time belongs in code, so the weekly metrics table is built by a small deterministic renderer and inserted into the brief.
def render_table(headers: list[str], rows: list[list[str]]) -> str:
"""Format a markdown table in code, so output never depends on the model
getting the row breaks right."""
line = lambda cells: "| " + " | ".join(str(c) for c in cells) + " |"
sep = "| " + " | ".join("---" for _ in headers) + " |"
return "\n".join([line(headers), sep, *(line(r) for r in rows)])
Generate a brief through the faster path
Now wire it together. The model writes the prose at low effort; the code owns the table and assembles the saved brief. The model never has to render the metrics, so a bad sample can’t corrupt them.
from fitness import db
from fitness.brief import generate_brief, brief_text, render_table
def daily_brief(day: str | None = None) -> str:
snapshot = db.snapshot_for(day)
prose = brief_text(generate_brief(snapshot)) # low effort by default
table = render_table(
["Metric", "Today", "7-day avg"],
[
["Resting HR", snapshot["resting_hr"], snapshot["resting_hr_7d"]],
["HRV (ms)", snapshot["hrv_ms"], snapshot["hrv_ms_7d"]],
["Sleep (h)", snapshot["sleep_hours"], snapshot["sleep_hours_7d"]],
],
)
brief = f"{prose.strip()}\n\n{table}\n"
db.save_brief(day, brief) # repair/normalize at save
return brief
The coach prose now arrives in roughly 80 to 95 seconds, and the metrics table renders clean no matter how the model gets sampled.
Verify with a blind A/B
The whole change rests on one claim: low effort is at least as good as the old default. Faith in a setting is not evidence, so I scored it. A separate judge call, the same model in a fresh context, sees only the brief text and never learns which effort produced it. Running the same snapshots at both efforts and counting wins is enough to confirm or kill the change. Two things keep the comparison honest: the A/B drops any sample where either arm stopped on max_tokens instead of end_turn, so a truncated high-effort brief can’t lose on length alone; and because the judge shares the writer’s model, this is a directional read rather than an oracle, which is why the Next section adds a second judge model.
import json
import anthropic
client = anthropic.Anthropic()
SCORE_SCHEMA = {
"type": "object",
"additionalProperties": False,
"properties": {
# The structured-outputs schema subset has no minimum/maximum, so bound
# each score with an enum instead. That guarantees a 1-5 integer.
"specificity": {"type": "integer", "enum": [1, 2, 3, 4, 5]},
"voice": {"type": "integer", "enum": [1, 2, 3, 4, 5]},
"no_repetition": {"type": "integer", "enum": [1, 2, 3, 4, 5]},
"no_filler": {"type": "integer", "enum": [1, 2, 3, 4, 5]},
},
"required": ["specificity", "voice", "no_repetition", "no_filler"],
}
JUDGE_SYSTEM = "Score the brief 1-5 on specificity, voice, no_repetition, no_filler."
def score_brief(brief: str) -> dict:
"""Blind: the judge sees only the text, not the effort that made it."""
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=200,
output_config={"effort": "low", "format": {"type": "json_schema", "schema": SCORE_SCHEMA}},
system=JUDGE_SYSTEM,
messages=[{"role": "user", "content": brief}],
)
# Filter to text blocks: adaptive thinking can prepend a thinking block,
# so content[0] isn't guaranteed to be the JSON.
text = "".join(b.text for b in resp.content if b.type == "text")
return json.loads(text)
def ab_compare(snapshots: list[dict], efforts=("low", "high")) -> dict[str, int]:
wins = {e: 0 for e in efforts}
for snap in snapshots:
# Generate every arm first so truncated samples can be dropped before
# scoring. A high-effort arm that hit max_tokens during thinking would
# otherwise be scored as a short or empty brief and lose on length, which
# measures truncation, not quality.
messages = {e: generate_brief(snap, effort=e) for e in efforts}
if any(m.stop_reason != "end_turn" for m in messages.values()):
continue # control for truncation: skip the whole snapshot
scored = {e: sum(score_brief(brief_text(messages[e])).values()) for e in efforts}
best = max(scored.values())
leaders = [e for e, s in scored.items() if s == best]
# Ties break toward the earliest effort in the tuple (here "low"): equal
# quality at lower cost is the win I want, and it's the conservative call
# against the shipping bar below.
winner = min(leaders, key=list(efforts).index)
wins[winner] += 1
return wins
# wins["low"] >= wins["high"] is the bar the change had to clear before shipping.
The blind judge scored the low-effort briefs against the old default and the low ones won on every measure: specificity, voice, no repetition, no filler. The only way I’d ever have learned that is by scoring the output instead of assuming more compute is safer.
Next
A wider A/B across more days and a second judge model to lock in the quality read, and an optional pass that assembles the brief’s data in code to trim the last few seconds of tool round-trips.
Sources
- Program optimization (Knuth on premature optimization) — profile before you optimize; small efficiencies aren’t where the time is.
- Anthropic: Building effective agents — start simple, add complexity only when it measurably helps.
- Perfect is the enemy of good — more effort past “good enough” can make the result worse.
- Claude: the effort parameter —
output_config.effortcontrols thinking depth and token spend; default is high, lower means terser, cheaper output.