Give the agent a tool, not a shell | Nate Swenson Dev Log

Shipped

I asked my fitness agent a simple question: show me my training plan through today. It got there, but first it opened a shell, poked at the database schema, and ran a couple of SQL queries that errored before it finally printed a table. The answer was fine. Everything in front of the answer was noise I never asked for.

v0.7.0 adds get_training_plan_progress, which returns the whole plan in one call: every prescribed workout with its verdict (done, missed, pending, or rest), plus days to race, adherence, and a projected finish. Now the agent makes one tool call and hands back a clean table, with no shell and no schema spelunking. I also added a short rule, in the prompt and in the project’s instructions, to prefer the structured tools and never drop to sqlite by hand. The part worth writing about is how to actually build a tool like this so the agent never wants the shell again, and which of those two changes does the real work.

When an agent improvises, a tool is missing

The shell-poking was not the agent misbehaving. It was the agent routing around a gap. It had no clean way to ask for the graded plan day by day, so it reached for the lowest-level access it did have, raw SQL against the database. An agent dropping to primitives is a signal worth reading literally: the high-level affordance it needed did not exist, so it improvised one badly.

That maps onto how Anthropic frames agent design. The guidance is to invest in well-designed, high-level tools that return what the caller actually wants, rather than making the model assemble primitives itself (Anthropic: Writing tools for AI agents), and to keep the surrounding system as simple as the task allows (Anthropic: Building effective agents). The data was already there; only the web view knew how to read it. Giving the agent the same reading as a single tool is what removed the improvisation. The rest of this post is that tool, built end to end.

Setup: model the plan and what graded means

Before there can be a tool, there has to be a clear data model and a precise definition of the answer. The plan is a list of prescribed workouts keyed by date. Separately, logged workouts record what actually happened. Grading is the join between them. Start with plain, frozen dataclasses and a Verdict type so the three states are named once and reused everywhere.

# fitness/plan.py
from dataclasses import dataclass
from datetime import date
from typing import Literal

Verdict = Literal["done", "missed", "pending", "rest"]

@dataclass(frozen=True)
class Prescribed:
    day: date
    workout: str          # human label, e.g. "8 mi easy"
    target_miles: float

@dataclass(frozen=True)
class Logged:
    day: date
    miles: float          # what Garmin or a manual entry recorded

The data layer already knows how to read both lists out of SQLite. The point of this release is that the agent should never have to.

Grade each day, with one honest boundary

The grading rule is small, and almost all of its value is in one boundary. A rest day owes nothing, so it short-circuits first: a past or current rest day is rest, a future one is pending. For a real workout, the past with nothing logged is missed and the future is pending. The case that is easy to get wrong is today: a workout prescribed for today with nothing logged yet is still pending, not missed, because the day is not over. Grade today as missed and the agent will tell an athlete they failed a workout they still have hours to run.

# fitness/plan.py (continued)
def grade_day(p: Prescribed, logged: Logged | None, today: date) -> Verdict:
    if p.target_miles == 0:       # rest day: nothing is owed
        return "pending" if p.day > today else "rest"
    if logged is not None and logged.miles > 0:
        return "done"
    if p.day >= today:            # today is not over yet, future is not owed yet
        return "pending"
    return "missed"               # prescribed before today, nothing logged

Keeping today an explicit argument rather than calling date.today() inside the function is what makes the boundary testable later. The function is pure, so a test can hand it any reference date.

Assemble the plan and expose it as one tool

Now wrap the grading in an aggregation that also computes the summary fields the caller wants, then expose the whole thing as a single tool. Adherence is completed over owed workouts, counting only resolved days (graded done or missed); rest days and not-yet-due days do not count, so the denominator never includes a workout the athlete has not actually had a chance to finish. The projected finish is a deliberately simple model; the real one can drop in behind the same return shape, which is the point of returning structured data the caller does not have to assemble (MCP Python SDK).

# fitness/plan.py (continued)
def project_finish(goal_minutes: float, adherence: float | None) -> float:
    # Simple stand-in model: missed volume slows you proportionally.
    # Swap in your real projection without changing the return shape.
    if adherence is None:
        return goal_minutes
    penalty = (1 - adherence) * 0.08          # up to ~8% slower at zero adherence
    return round(goal_minutes * (1 + penalty), 1)

def training_plan_progress(plan: list[Prescribed], logs: list[Logged],
                           today: date, goal_minutes: float) -> dict:
    by_day = {l.day: l for l in logs}
    days, completed, owed = [], 0, 0
    for p in plan:
        verdict = grade_day(p, by_day.get(p.day), today)
        if verdict in ("done", "missed"):   # only resolved days count
            owed += 1
            completed += verdict == "done"
        days.append({"date": p.day.isoformat(), "prescribed": p.workout,
                     "verdict": verdict})
    adherence = round(completed / owed, 2) if owed else None
    return {
        "days": days,
        "days_to_race": (plan[-1].day - today).days,
        "adherence": adherence,
        "projected_finish_minutes": project_finish(goal_minutes, adherence),
    }

The tool itself is a thin wrapper. It reads the two lists, calls today’s date once, and returns the structured result. I register it with the MCP Python SDK’s FastMCP (the mcp package): you create a server, decorate the function with @mcp.tool(), and FastMCP derives the tool name from the function name and the description from the docstring, then inspects the type hints for the schema. The agent calls it with no arguments and gets back everything it would otherwise have tried to reconstruct with SQL.

# fitness/tools.py
from datetime import date
from mcp.server.fastmcp import FastMCP
from fitness import db
from fitness.plan import training_plan_progress

mcp = FastMCP("fitness")

@mcp.tool()
def get_training_plan_progress() -> dict:
    """Return the full training plan with per-day grading and projections."""
    return training_plan_progress(
        plan=db.prescribed_workouts(),
        logs=db.logged_workouts(),
        today=date.today(),
        goal_minutes=db.race_goal_minutes(),
    )

if __name__ == "__main__":
    mcp.run()   # stdio transport by default; the agent connects to this server

Use it: one call, one clean answer

Asking the agent for the plan now triggers a single tools/call, and the tool returns the same structured shape every time. The agent formats that into a table instead of opening a shell.

{
  "days": [
    {"date": "2026-06-19", "prescribed": "5 mi easy",       "verdict": "missed"},
    {"date": "2026-06-20", "prescribed": "8 mi easy",       "verdict": "done"},
    {"date": "2026-06-21", "prescribed": "rest",            "verdict": "rest"},
    {"date": "2026-06-22", "prescribed": "5 mi tempo",      "verdict": "pending"},
    {"date": "2026-06-23", "prescribed": "12 mi long run",  "verdict": "pending"},
    {"date": "2026-08-25", "prescribed": "race day",        "verdict": "pending"}
  ],
  "days_to_race": 64,
  "adherence": 0.5,
  "projected_finish_minutes": 214.7
}

Today’s tempo run reads pending, not missed, which is the boundary doing its job, and the rest day reads rest rather than being scored as a missed workout. Adherence is 0.5 because only two days have resolved (one done, one missed); today, the future runs, and the rest day stay out of the denominator. Everything before today is already graded, and the summary fields ride along so the agent never has to total anything itself.

Verify the boundary

The grading rule is pure, so the test is direct, and the one case worth defending is the today boundary. Pin a reference date and assert each verdict, including that today with nothing logged stays pending and a past rest day reads rest rather than missed. This is the failure mode that would quietly tell an athlete they missed a workout they still had time to run.

# tests/test_plan.py
from datetime import date
from fitness.plan import Prescribed, Logged, grade_day

TODAY = date(2026, 6, 22)

def test_logged_is_done():
    p = Prescribed(TODAY, "5 mi tempo", 5.0)
    assert grade_day(p, Logged(TODAY, 5.1), TODAY) == "done"

def test_today_unlogged_is_pending_not_missed():
    p = Prescribed(TODAY, "5 mi tempo", 5.0)
    assert grade_day(p, None, TODAY) == "pending"   # the day is not over

def test_past_unlogged_is_missed():
    p = Prescribed(date(2026, 6, 21), "5 mi easy", 5.0)
    assert grade_day(p, None, TODAY) == "missed"

def test_past_rest_day_is_rest_not_missed():
    p = Prescribed(date(2026, 6, 21), "rest", 0.0)
    assert grade_day(p, None, TODAY) == "rest"   # resting is not failing

def test_future_is_pending():
    p = Prescribed(date(2026, 6, 23), "12 mi long run", 12.0)
    assert grade_day(p, None, TODAY) == "pending"

The second test fails the moment someone simplifies the boundary to p.day > today, which is exactly the off-by-one that would mislabel today’s work.

The rule is a backstop, not the fix

I also added the instruction to prefer structured tools and avoid hand-written SQL, and I want to be honest about what that rule does. It is not the fix. The tool is the fix. A rule asking the model to behave is only reliable when behaving is also the path of least resistance, which is really a context-engineering point: behavior is shaped by what is available and what the instructions say, together (Anthropic: Effective context engineering for AI agents). With the tool in place, the rule keeps the agent honest on the days it gets clever and wants to go spelunking anyway. Without the tool, the same rule is just a sternly worded request the model can ignore the moment SQL looks faster. When you catch an agent improvising with low-level access, the durable move is to build the capability it was missing and then point the instructions at it.

The brief’s cross-model A/B check turned out to be flaky in a way that predates this release. Making that harness reliable is the next cleanup, so prompt changes stay cheap to verify.

Sources

Anthropic: Writing tools for AI agents — invest in high-level tools that return what the caller wants.
Anthropic: Building effective agents — keep the system as simple as the task allows.
Anthropic: Effective context engineering for AI agents — behavior is shaped by tools and instructions together.
MCP Python SDK — define a tool that returns structured data the caller does not have to assemble.

Changelog

feat: get_training_plan_progress tool + prefer-structured-tools nudge (0.7.0) (09827e5)
docs: design for clean fitness Q&A (plan-progress tool + presentation contract) (fd92067)