Selectable coach tones, and making the model's dials actually do something

Shipped

The daily brief used to have one voice: supportive when things were going well, a roast when I was slipping. That blend was mine and baked into the prompt. v0.10.0 makes it a profile you choose. There are four: supportive (your biggest believer), neutral (the emotion removed), hardass (never satisfied), and adaptive (the old behavior, kept as the default so a fresh clone is unchanged). Each is a real file you can read and edit, with a written voice and a set of numbers, and you switch with a config command.

The personas were the easy half. The two things that actually took thought were making the numbers mean something and proving the tones behave. Both generalize past a fitness app, so here is the end-to-end build: profiles as files, a prompt builder that turns one of those numbers into a real code switch, the call site that uses it, and the test that keeps it honest.

A profile is a file, not a personality

The first decision is where a tone lives. Hard-coding four prompt strings in a match statement works until you want to read them side by side or tweak one without redeploying. So each tone is a flat config file: a block of prose that sets the voice, and a few numbers that the code reads. Keeping this kind of static app data in declarative files instead of code is the old separate-data-from-code instinct; it rhymes with twelve-factor config, though strictly that factor targets settings that vary between deploys, and these presets are the same in every deploy (The Twelve-Factor App: Config). A reader can open hardass.toml and see exactly what that coach is, prose and dials together.

# tones/hardass.toml
name = "hardass"
voice = """
You are a demanding strength coach who is never quite satisfied.
Acknowledge effort in one line, then push for the next level.
Translate jargon, keep it short, never pad with praise.
"""
harsh = true          # read by code, not just the model
warmth = 2            # 1-10, lives in the prose only
target_pressure = 9   # 1-10, lives in the prose only

The other three are the same shape. adaptive is the default, so it has to exist for a fresh clone to boot, and supportive is the soft counterweight the tests lean on, so both are real files too. neutral is the same pattern with the warmth dialed flat:

# tones/adaptive.toml: the default, the old blended behavior
name = "adaptive"
voice = """
You read the week. Celebrate real wins warmly, but when the user is
slipping against a goal, name the gap and push for the next level.
Translate jargon, keep it short.
"""
harsh = true
warmth = 6
target_pressure = 6

# tones/supportive.toml
name = "supportive"
voice = """
You are the user's biggest believer. Lead with what went well,
frame every gap as the next small step, never scold.
Translate jargon, keep it short.
"""
harsh = false
warmth = 9
target_pressure = 3

# tones/neutral.toml
name = "neutral"
voice = """
You are a flat, factual coach. State what the data shows and the next
action, with no encouragement and no scolding.
Translate jargon, keep it short.
"""
harsh = false
warmth = 1
target_pressure = 5

Notice the split that the rest of the post turns on. warmth and target_pressure are numbers I hand to the model inside the prose, and the model interprets them as best it can. harsh is a boolean my code reads directly. They look like peers in the file; they are not peers in how much I can trust them, and the next section is why.

Load the profile into something typed

Before any of that is useful, the files need to become a typed object the rest of the program can pass around. A small loader reads the TOML, validates that the fields exist, and returns a frozen dataclass. Doing the validation once at load time means a malformed tone fails loudly at startup rather than silently producing a weird brief at 6am. This needs Python 3.11+, where tomllib is in the standard library (on older versions the tomli backport is a drop-in, and the int | None annotations below want 3.10+).

# tones/profile.py
import tomllib
from dataclasses import dataclass
from pathlib import Path

TONES_DIR = Path(__file__).parent

@dataclass(frozen=True)
class Profile:
    name: str
    voice: str          # the prose the model reads
    harsh: bool         # the dial the code reads
    warmth: int
    target_pressure: int

def load_profile(name: str) -> Profile:
    path = TONES_DIR / f"{name}.toml"
    if not path.exists():
        raise ValueError(f"unknown tone {name!r}; have {available_tones()}")
    data = tomllib.loads(path.read_text())
    return Profile(
        name=data["name"],
        voice=data["voice"].strip(),
        harsh=bool(data["harsh"]),
        warmth=int(data["warmth"]),
        target_pressure=int(data["target_pressure"]),
    )

def available_tones() -> list[str]:
    return sorted(p.stem for p in TONES_DIR.glob("*.toml"))

Now load_profile("hardass") gives back a Profile with a real harsh flag, and an unknown tone name is an error with a helpful message instead of a FileNotFoundError three layers down.

A number next to a prompt is a hope, not a control

The tempting design is to write a persona with prose and a dial, “harshness: 9,” and assume the model reads the dial and acts on it. It might. It might also glance at a 9 sitting next to a paragraph and do its own thing. A knob the system can’t falsify is just decoration you’ll end up trusting anyway.

So for the parts of the brief with a concrete goal, like step count and plan adherence, the profile decides in code whether the harsh instruction block goes into the prompt at all. A harsh profile assembles that block, a soft profile leaves it out, and the prose voice handles the rest. The model never sees a number it has to honor for this behavior; it sees text that is either present or absent.

# brief/prompt.py
from dataclasses import dataclass
from tones.profile import Profile

HARSH_BLOCK = (
    "Hold the user to their goal. If steps or plan-adherence fall short of "
    "target, say so directly and push for more. Do not soften a miss."
)

@dataclass
class Goals:
    step_target: int | None = None
    plan_id: str | None = None

    def has_concrete_targets(self) -> bool:
        """True only when there is a measurable goal to be harsh about."""
        return bool(self.step_target or self.plan_id)

def build_brief_prompt(base: str, profile: Profile, goals: Goals) -> str:
    prompt = f"{base}\n\n{profile.voice}"
    # the soft dials go into the prose for the model to interpret as best it can
    prompt += (
        f"\n\nCalibrate warmth to {profile.warmth}/10 and "
        f"target pressure to {profile.target_pressure}/10."
    )
    if profile.harsh and goals.has_concrete_targets():
        prompt += "\n\n" + HARSH_BLOCK   # a switch in code, not a hope about the model
    return prompt

That is the distinction Anthropic draws between workflows and agents: a workflow orchestrates the model through predefined code paths when you want predictability, rather than letting the model decide (Anthropic: Building effective agents). The reliable lever isn’t a number I hope the model honors, it’s what I deterministically choose to put in front of it, which is the whole idea behind treating context as something you engineer rather than dump (Anthropic: Effective context engineering for AI agents). The dial now flips a real switch in code, and the harsh block only appears when there is a concrete target to be harsh about.

The rule I took from it: if a configuration value can’t change something you can observe and assert, it isn’t configuration, it’s a comment.

Assemble the brief for a chosen profile

The call site is where the pieces meet. A config command writes the chosen tone name to settings, the brief loop reads it back, loads that profile, and builds the prompt around the day’s snapshot. The brief loop never branches on tone itself; it just loads whatever profile is configured and hands the assembled prompt to the model.

# brief/run.py
from anthropic import Anthropic
from tones.profile import load_profile
from brief.prompt import build_brief_prompt
from fitness import db, settings

def _complete(prompt: str) -> str:
    client = Anthropic()
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=600,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.content[0].text

def run_daily_brief(day: str | None = None) -> str:
    profile = load_profile(settings.get("tone", default="adaptive"))
    snapshot = db.snapshot_for(day)
    goals = db.goals_for(day)

    base = f"Write today's training brief from this snapshot:\n{snapshot}"
    prompt = build_brief_prompt(base, profile, goals)
    return _complete(prompt)

# switching tone is a one-line config change, no redeploy:
#   $ fitness config set tone hardass

Here db, settings, and the fitness config command are this app’s own data and settings layer; any key/value store you already have works in their place. Same loop, same snapshot, four possible voices, and exactly one of them carries the harsh block, decided before the request ever leaves the machine.

Test the tone, don’t eyeball it

The other half was testing the way I keep saying I want to, against expectations instead of by reading one nice sample. The deterministic part is easy and exact: assert that the harsh block is present for hardass with a real target, and absent for supportive. That test fails the moment someone renames the flag, inverts the condition, or drops the target check. The fakes those tests use are one-liners that build a Goals with and without a step_target, so they satisfy the same has_concrete_targets() contract the real object does:

# tests/fakes.py
from brief.prompt import Goals

def goals_with_target() -> Goals:
    return Goals(step_target=8000)

def goals_without_target() -> Goals:
    return Goals()

# tests/test_brief_prompt.py
from tones.profile import load_profile
from brief.prompt import build_brief_prompt, HARSH_BLOCK
from tests.fakes import goals_with_target, goals_without_target

BASE = "Write today's training brief."

def test_harsh_block_present_for_hardass():
    prompt = build_brief_prompt(BASE, load_profile("hardass"), goals_with_target())
    assert HARSH_BLOCK in prompt

def test_harsh_block_absent_for_supportive():
    prompt = build_brief_prompt(BASE, load_profile("supportive"), goals_with_target())
    assert HARSH_BLOCK not in prompt

def test_harsh_block_skipped_without_a_concrete_target():
    # even a harsh profile has nothing to be harsh about with no goal
    prompt = build_brief_prompt(BASE, load_profile("hardass"), goals_without_target())
    assert HARSH_BLOCK not in prompt

The model’s actual words are the part you can’t assert exactly, and that is where people give up and go back to eyeballing. The move is to assert the properties instead. Anthropic’s own guidance is to define evals around the things that matter rather than expecting deterministic output (Anthropic: Demystifying evals for AI agents). So a scorer runs every available profile against a fixed snapshot and checks the properties each output must hold, then A/B-checks the tones against expectations rather than against my gut.

The wrapper that makes that possible lives in a test-only module and imports the production pieces, never the other way around. That direction matters: brief/run.py has no idea the harness exists, so importing it at 6am to write the real brief never drags in pytest or the tests package. generate_for_test builds the same prompt against one fixed snapshot and returns a small parsed object, and tone_category() is a second model call grading the first, an LLM judge reading the output and naming the coach who must have written it.

# tests/harness.py
from dataclasses import dataclass
from tones.profile import load_profile
from brief.prompt import build_brief_prompt
from brief.run import _complete
from tests.fakes import goals_with_target

FIXED_SNAPSHOT = "steps 4200/8000, zone-2 12 min, HRV 41ms, push-day skipped"

@dataclass
class BriefOutput:
    tone: str
    text: str

    def matches_schema(self) -> bool:        # same shape regardless of voice
        return bool(self.text.strip()) and "\n" in self.text

    def translated_jargon(self) -> bool:     # best-effort signal, not a hard gate
        return all(term not in self.text for term in ("HRV", "zone-2"))

    def directive_count(self) -> int:        # proxy for how hard it pushes
        starts = ("Do ", "Push", "Get ", "Hit ")
        return sum(line.strip().startswith(starts) for line in self.text.splitlines())

    def tone_category(self) -> str:          # the LLM judge, below
        return judge_tone(self.text)

def generate_for_test(tone: str) -> BriefOutput:
    profile = load_profile(tone)
    base = f"Write today's training brief from this snapshot:\n{FIXED_SNAPSHOT}"
    prompt = build_brief_prompt(base, profile, goals_with_target())
    return BriefOutput(tone=tone, text=_complete(prompt))

def judge_tone(text: str) -> str:
    # one judge call: hand the model the brief, get back a single category word
    verdict = _complete(
        "Reply with exactly one word from "
        "[supportive, neutral, hardass, adaptive] naming this brief's tone:\n\n" + text
    )
    return verdict.strip().lower()

Every method on BriefOutput except tone_category() is a cheap local check; tone_category() and generate_for_test() are the parts that actually call the model, which is why the property tests below are gated. They hit the live API: a real key, real cost, and output that drifts run to run. Gating on the key’s presence would backfire, since anyone working on this app already has ANTHROPIC_API_KEY exported for run.py, so a plain pytest would charge them on every commit. The gate is an explicit opt-in instead: they run only when you set RUN_LLM_TESTS=1, so a plain pytest run and CI skip them, and they run on their own cadence (or against a mock) rather than in the commit loop.

# tests/test_tone_properties.py
import os
import pytest
from tones.profile import available_tones
from tests.harness import generate_for_test

requires_llm = pytest.mark.skipif(
    not os.environ.get("RUN_LLM_TESTS"),
    reason="set RUN_LLM_TESTS=1 to run the paid LLM property tests",
)

@requires_llm
@pytest.mark.parametrize("tone", available_tones())
def test_every_tone_returns_a_well_formed_brief(tone):
    # the structural contract holds for every voice, exactly
    out = generate_for_test(tone)
    assert out.matches_schema()

@requires_llm
def test_hardass_pushes_harder_than_supportive():
    # discriminating property: assert a contrast, not an exact label.
    # don't ask one judge call to nail four overlapping categories; the two
    # tones meant to be furthest apart should diverge on a measurable axis.
    hard = generate_for_test("hardass")
    soft = generate_for_test("supportive")
    assert hard.directive_count() > soft.directive_count()
    assert hard.tone_category() != soft.tone_category()

The reason tone_category() is asserted only as a contrast between an opposed pair, never as exact equality per tone, is that adaptive and hardass assemble nearly identical prompts and neutral and supportive overlap, so demanding the judge return the exact word for all four would flake without proving anything. The directive count is the real discriminator; the judge call just confirms the two ends read as different coaches. The jargon check rides along as a logged signal rather than a gate, since “no raw HRV or zone term ever survives” is a hope about the model, not a contract it owes me. The non-deterministic part gets contrast properties, the deterministic part gets exact ones, and the tone stops being a vibe I personally sign off on. The split also decides when each runs: the deterministic prompt tests are free and fast, so they run on every commit; the property tests cost a model call apiece and drift run to run, so they sit behind the RUN_LLM_TESTS opt-in and run on their own cadence rather than in the commit loop.

The profiles are fixed presets today. The natural next step is letting the saved coaching notes nudge a chosen profile on a specific point, since the precedence for that is already written into the prompt.

Sources

Anthropic: Building effective agents — workflows orchestrate models through predefined code paths for predictability.
Anthropic: Effective context engineering for AI agents — control what goes into the model’s context deliberately.
Anthropic: Demystifying evals for AI agents — evaluate non-deterministic systems on the properties that matter.
The Twelve-Factor App: Config — keep configuration that varies between deployments in declarative data outside the code.

Changelog

feat: selectable coach tone profiles for the daily brief (0.10.0) (e53b005)
docs: per-profile A/B + quality scorer vs expected outcomes (mandatory, not optional) (fc17711)
docs: design for selectable coach tone profiles (supportive/neutral/hardass) (f15df80)