A prompt is code you can't compile, so score it against the schema it feeds

Shipped

local-fitness pulls my Garmin data into a local SQLite database and lets a Claude agent write a daily training briefing. Until v0.1.0 it was a personal script; this release treats it like software someone else could run. I added a pytest suite over the deterministic core (database helpers, the user-notes store, the schemas, the baselines, the agent’s query tools) running against a seeded temporary database with no network, put it behind a package coverage gate, and wired CI to run linting, tests, and a prompt scorer on every push. A version-driven release workflow cuts a GitHub Release whenever the version in the project file is new, and I rewrote the README from private notes into a shareable OSS project with a license, badges, a privacy section, and cross-platform setup. The first CI run earned its keep by catching a real bug.

The part worth writing about is what it means to test an agent at all. Most of the app is ordinary code you can assert against, but the briefing is produced by a prompt, and a prompt doesn’t compile. So I scored it, and the check I’m happiest with cross-validates the prompt against the output schema it feeds. Here is the end-to-end build: the schema, the scorer, the hermetic test core, the CI wiring, and the failures that prove it works.

A prompt is code you can’t compile

The prompt that tells the agent how to write a briefing is logic, the same way a function is logic. It carries rules (never fabricate a number), translations (turn training-load jargon into plain coaching language), tone, and the user’s saved notes. You can’t run a type checker over English, so the usual safety net is missing exactly where the behavior is most load-bearing. The answer isn’t to give up on rigor; it’s to grade the prompt against grounded pass/fail checks, the same discipline you’d apply to any other untested surface.

This is what Anthropic calls evals, and their framing maps cleanly onto a coaching agent: groundedness checks that claims trace back to real data, coverage checks for the facts a good answer must include, and graders that turn a fuzzy “is this good” into concrete outcomes (Anthropic: Demystifying evals for AI agents). My scorer is small and opinionated. It confirms the prompt keeps the never-fabricate rule, that it still translates the jargon, that the coaching tone survives an edit, and that saved user notes are actually wired in. None of those are unit tests in the classic sense, but each one is a real assertion about a part of the app I’d otherwise only check by reading it and hoping. The strongest of those assertions ties the prompt to something machine-readable, so start there.

Setup: make the output schema the single source of truth

The briefing’s output is validated against a set of Pydantic enums: the allowed metrics and the allowed tones, the vocabulary the rest of the app understands. Pydantic enforces that a value is a real member of the enum or it doesn’t pass (Pydantic: Standard Library Types). Defining these enums once gives every later step a contract to check against, instead of each layer carrying its own private list of “valid” strings.

# fitness/schema.py
from enum import Enum
from pydantic import BaseModel

class Metric(str, Enum):
    """The only metrics a briefing is allowed to report."""
    RECOVERY = "recovery"
    SLEEP = "sleep"
    TRAINING_LOAD = "training_load"

class Tone(str, Enum):
    """The only voices a briefing is allowed to adopt."""
    ENCOURAGING = "encouraging"
    NEUTRAL = "neutral"
    DIRECT = "direct"

class Briefing(BaseModel):
    """Agent output is parsed into this; an unknown metric or tone raises ValidationError."""
    metrics: list[Metric]
    tone: Tone
    body: str

def allowed_metrics() -> set[str]:
    """Every metric string the rest of the app will accept, from one place."""
    return {m.value for m in Metric}

def allowed_tones() -> set[str]:
    """Every tone string the rest of the app will accept, from one place."""
    return {t.value for t in Tone}

allowed_metrics() and allowed_tones() are the load-bearing exports. Anything downstream that wants to know “is this a real metric” or “is this a real tone” asks the schema, not a hand-maintained constant. Keeping the two categories separate matters: a tone that wandered into the metrics list is still drift, and a union would let that cross-category mistake slip through.

Build: a scorer that parses the prompt and asserts a subset

The prompt advertises metrics and tones in prose, something like “report recovery, sleep, and training load in an encouraging voice.” That sentence and the enum above are two descriptions of the same contract, and the moment they disagree, briefings break. Because the agent’s output is parsed into the Pydantic model, an off-schema value doesn’t slip through unnoticed; it raises.

from fitness.schema import Briefing

# agent_output is the JSON the agent produced for one briefing
briefing = Briefing.model_validate_json(agent_output)  # ValidationError if any
                                                       # metric or tone is off-schema

So drift fails loud, which sounds safe until you notice where it fails: at parse time, in production, the moment a user asks for today’s briefing. The model emits the metric the prompt promised, the schema rejects it, and the request dies with a ValidationError instead of returning anything. That’s a real failure on the user’s side, triggered by an edit to a text file no test was watching.

The fix is to give the contract one source of truth and move that failure earlier. So I keep the prompt’s promises in a delimited block I can parse, pull the tokens out, and assert they’re a subset of the enum members. Drift exits non-zero in CI, so the mismatch is caught on push instead of surfacing as a failed briefing request the next morning (Fail-fast system).

Here is the artifact the scorer reads. The prose rules sit next to machine-readable METRICS:/TONES: blocks, and the {user_notes} placeholder is where saved notes get interpolated at runtime.

# fitness/briefing_prompt.txt
You are the athlete's training coach. Write today's briefing from the data
provided, and follow these rules.

Never fabricate a number; if a metric is missing, say so plainly.
Translate training load into plain coaching language instead of raw jargon.
Keep an encouraging coaching tone throughout.

METRICS:
- recovery
- sleep
- training_load

TONES:
- encouraging
- neutral
- direct

Saved athlete notes to respect:
{user_notes}

# fitness/score_prompt.py
import re
import sys
from fitness.schema import allowed_metrics, allowed_tones

def advertised_tokens(prompt: str) -> dict[str, set[str]]:
    """Pull the tokens the prompt promises from delimited METRICS:/TONES: blocks.

    The prompt keeps machine-readable lists alongside its prose, e.g.

        METRICS:
        - recovery
        - sleep
        TONES:
        - encouraging

    Each heading sits on its own line, followed by one `- token` per line.
    re.MULTILINE lets the blocks appear anywhere in the prompt and tolerates
    leading indentation, so the scorer isn't welded to one exact placement.
    """
    found: dict[str, set[str]] = {"METRICS": set(), "TONES": set()}
    for heading in found:
        block = re.search(
            rf"^{heading}:\s*$\n((?:^\s*-\s+.+$\n?)+)",
            prompt,
            re.MULTILINE,
        )
        if block:
            found[heading] = {
                line.split("-", 1)[1].strip()
                for line in block.group(1).splitlines()
                if line.strip().startswith("-")
            }
    return found

def score(prompt: str) -> list[str]:
    """Return a list of failures. Empty list means the prompt is in sync."""
    failures: list[str] = []
    advertised = advertised_tokens(prompt)

    # The key assertion: everything the prompt promises must exist in the
    # schema, and each category is checked against its own enum so a tone
    # listed as a metric (or vice versa) still counts as drift. Drift here
    # is what breaks briefings silently downstream.
    for heading, allowed in (
        ("METRICS", allowed_metrics()),
        ("TONES", allowed_tones()),
    ):
        drift = advertised[heading] - allowed
        if drift:
            failures.append(
                f"prompt advertises non-schema {heading.lower()}: {sorted(drift)}"
            )

    # Grounded prose checks, so editing the prompt can't quietly drop the
    # rules the briefing depends on.
    if "never fabricate" not in prompt.lower():
        failures.append("prompt dropped the never-fabricate rule")
    if "training load" not in prompt.lower():
        failures.append("prompt dropped the training-load jargon translation")
    if "coaching tone" not in prompt.lower():
        failures.append("prompt dropped the coaching-tone instruction")
    if "{user_notes}" not in prompt:
        failures.append("prompt no longer wires in saved user notes")
    return failures

def main() -> int:
    with open("fitness/briefing_prompt.txt") as f:
        failures = score(f.read())
    for f in failures:
        print(f"FAIL: {f}", file=sys.stderr)
    return 1 if failures else 0

if __name__ == "__main__":
    sys.exit(main())  # non-zero exit -> CI fails loud

Note the direction of the check. I assert the advertised tokens are a subset of the schema, not the reverse. The schema is allowed to know about a metric the prompt hasn’t started using yet; the prompt is never allowed to promise something the schema can’t validate. That asymmetry is the whole point: the machine-readable contract is the ceiling.

Build: a hermetic test fixture for the deterministic core

The deterministic core got real tests, but the detail that taught me something was how the first CI run failed. My security tests passed on my machine and blew up on a clean checkout, because they were silently reading my actual local Garmin database. On my laptop that file exists, so the tests looked green; on a fresh runner it doesn’t, so they fell over. Green tests that aren’t actually exercising the code are their own kind of silent failure, a plausible signal hiding a broken state (aipatternbook: Silent Failure). That’s the exact failure hermetic testing is designed to prevent: a test’s result shouldn’t depend on the machine or the person running it, so it brings its own throw-away dependencies and touches no network (Google Testing Blog: Hermetic Servers).

The fixture below creates a fresh SQLite file in a temp directory, seeds it with known rows, and points the app’s database path at it for the duration of the test. Nothing reads my real data, and nothing reaches the network.

# tests/conftest.py
import sqlite3
import pytest

@pytest.fixture
def fitness_db(tmp_path, monkeypatch):
    """A seeded, throw-away database. Created fresh, discarded after, no network."""
    db_path = tmp_path / "fitness.db"
    conn = sqlite3.connect(db_path)
    conn.executescript(
        """
        CREATE TABLE daily (
            day TEXT PRIMARY KEY,
            recovery INTEGER,
            sleep_minutes INTEGER,
            training_load TEXT
        );
        """
    )
    conn.execute(
        "INSERT INTO daily VALUES ('2026-06-06', 72, 444, 'moderate')"
    )
    conn.commit()
    conn.close()

    # Point the app at the temp DB so no test can touch the real one.
    monkeypatch.setenv("FITNESS_DB", str(db_path))
    return db_path

The fixture lives in conftest.py so pytest shares it across files, but the test itself belongs in a normal test module; pytest doesn’t collect tests out of conftest.py. The test asks for the fitness_db fixture by name and reads the seeded row through the app’s own data layer.

# tests/test_snapshot.py
from fitness import db

def test_snapshot_reads_seeded_row(fitness_db):
    row = db.snapshot_for("2026-06-06")
    assert row.recovery == 72
    assert round(row.sleep_minutes / 60, 1) == 7.4

That data layer is the small module under test. The detail that makes the fixture work is that snapshot_for reads FITNESS_DB inside the call, not at import time, so the fixture’s monkeypatch.setenv is already in effect when the path is resolved.

# fitness/db.py
import os
import sqlite3
from typing import NamedTuple

class Snapshot(NamedTuple):
    day: str
    recovery: int
    sleep_minutes: int
    training_load: str

def snapshot_for(day: str) -> Snapshot:
    """Read one day's row from whatever database FITNESS_DB points at.

    Resolving the env var here (not at import) means tests can repoint it
    before the call without reloading the module.
    """
    db_path = os.environ["FITNESS_DB"]
    conn = sqlite3.connect(db_path)
    try:
        cur = conn.execute(
            "SELECT day, recovery, sleep_minutes, training_load "
            "FROM daily WHERE day = ?",
            (day,),
        )
        row = cur.fetchone()
    finally:
        conn.close()
    if row is None:
        raise KeyError(day)
    return Snapshot(*row)

What actually surfaced the bug was the clean runner with no copy of my local database, not the coverage gate; coverage measures which lines executed, it can’t tell that a test read the wrong file. What the coverage gate does is keep untested paths from hiding, so the code that breaks on a fresh checkout is forced to run somewhere I’ll see it.

Wire it: lint, the coverage gate, and the scorer on every push

The three checks run together so nothing ships without passing all of them. Linting and the hermetic test suite cover the deterministic code, the coverage gate forces the tested paths to actually run, and the prompt scorer guards the one surface that doesn’t compile. Putting the scorer in the same pipeline as the tests is deliberate: the prompt is a build artifact, so it gets gated like one.

# .github/workflows/ci.yml
name: ci
on: [push, pull_request]

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e ".[dev]"

      - name: Lint
        run: ruff check .

      - name: Tests with coverage gate
        run: pytest --cov=fitness --cov-fail-under=85
        # The runner has internet by default, but the hermetic fixtures bring
        # their own seeded DBs and never make network calls, so the result
        # doesn't depend on the environment.

      - name: Score the prompt against the schema
        run: python -m fitness.score_prompt
        # Exits non-zero on drift, failing the job loud.

Each step fails the job on its own, so a drifted prompt and an under-covered module are both blocking, not advisory. The ruff and pytest-cov those steps depend on aren’t standard library, so they live in a dev extra that the pip install -e ".[dev]" step pulls in.

# pyproject.toml
[project.optional-dependencies]
dev = ["ruff", "pytest", "pytest-cov", "pydantic"]

Verify it: catch the drift and watch the test fail

Two failures prove the wiring. First, introduce real drift by removing a member from the enum while the prompt still advertises it, and run the scorer.

$ python -m fitness.score_prompt
FAIL: prompt advertises non-schema metrics: ['training_load']
$ echo $?
1

The scorer found that the prompt still promises training_load after it was dropped from Metric, and exited non-zero, so CI stops here instead of letting the next briefing request crash on a ValidationError in production. Second, make the seeded value and the assertion disagree, by changing one of them, and confirm the fixture isolates the test so the failure is identical everywhere.

$ pytest tests/test_snapshot.py -q
F
E   assert 70 == 72
E    +  where 70 = Snapshot(day='2026-06-06', recovery=70, ...).recovery
1 failed in 0.04s

The test reads only the seeded row, so a wrong assertion fails the same way on my laptop and on a clean runner. That equivalence is the whole point of making it hermetic.

Version-driven, idempotent releases

The release workflow is built so that merging is not the same thing as shipping. Once CI is green on the default branch, it checks the version in the project file and cuts a GitHub Release only if that version is new. A normal merge is a no-op; bumping the version is the act that ships. This makes releasing idempotent, in that running the workflow again on the same version does nothing, so there’s no risk of duplicate or accidental releases from an ordinary merge. The version number is the release key, which keeps the rule for changing it honest: I follow semantic versioning, where the number communicates the nature of the change, and 0.y.z explicitly signals early development where the API isn’t yet stable (Semantic Versioning 2.0.0). v0.1.0 is exactly that signal, a first public cut that says “real software now, but still moving.”

I want to widen coverage into the briefing-generation and chat paths, which are the parts still leaning mostly on the prompt scorer rather than executed tests. And I want to treat prompt changes as releases in their own right, since a change to the briefing prompt changes the product as surely as a code change does, and it should go through the same version-and-ship gate.

Sources

Anthropic: Demystifying evals for AI agents — groundedness, coverage, and graders for turning “is this good” into concrete outcomes.
Pydantic: Standard Library Types — enum validation accepts only real members of the enum.
aipatternbook: Silent Failure — systems that keep running while returning plausible-but-wrong output.
Fail-fast system — detect violations early and stop, rather than continuing in a broken state.
Google Testing Blog: Hermetic Servers — tests that bring their own dependencies and don’t depend on the environment.
Semantic Versioning 2.0.0 — version numbers that communicate the nature of a change; 0.y.z for initial development.

Changelog

[local-fitness] Treat the agent as code: score the prompt, test the core, ship CI (#16) (6279ddf)
[local-fitness] docs: rewrite README as a shareable OSS project; add MIT LICENSE (#17) (8b35e2c)