A skill is code: scoring the SKILL.md and testing the glue around it

Shipped

v0.0.1 is the release where the ghostwriter skill started being treated like code instead of a clever text file. A Claude skill is a folder with a SKILL.md plus the scripts it drives, and that is software whether or not it compiles (Anthropic: Equipping agents with Agent Skills). So it got the things software gets: a scorer that grades the SKILL.md against a checklist and passes 8/8, a pytest suite covering all five Python scripts at 100% line coverage across 94 tests, a shellcheck lint on the one bash script, all three wired into CI on every push and PR to main, and a real release cut with a version field, a changelog entry, an annotated tag, and a GitHub release.

The interesting part isn’t the green checkmarks. A prompt and a pile of integration glue are exactly the two kinds of code people skip testing, and each one needs a different tool to get honest coverage. Here is the end-to-end build: what to check, how to score it, how to test the glue without real I/O, and how to gate all of it in CI.

Setup: what a SKILL.md must contain

The body of a SKILL.md is a prompt. There’s no compiler to tell you it’s missing the guardrail that keeps it from publishing without approval, and no type checker to notice you dropped the line declaring its operating modes. The failure mode is silent: the skill still runs, it just runs without the rule you thought you’d written. Anthropic’s own guidance treats the SKILL.md as a structured artifact with required frontmatter and a defined shape (Anthropic: Equipping agents with Agent Skills), and anything with a required shape can be checked for that shape.

So the requirements are concrete things the file must literally contain. Anthropic requires name and description in the frontmatter; I also require version, since a skill that ships releases needs one. The body has to declare the skill’s operating modes, state the never-publish-without-approval guardrail, reference the no-automated-posting compliance rule, and name the voice inputs it learns the user’s style from. Each one is a string or a key you can search for, which means each one is a test.

Build the scorer

The scorer turns each requirement into a predicate over the file’s text and exits non-zero the moment one fails, which is what makes it a CI gate rather than a report. The real skill runs eight such checks; below are the load-bearing ones. The harness drops into any skill repo cleanly; swap the body predicates for your skill’s own required phrases, since the four body checks below hardcode ghostwriter’s exact policy strings.

#!/usr/bin/env python3
"""Score a SKILL.md against structural requirements. Exit non-zero so CI can gate."""
import re
import sys
from pathlib import Path

REQUIRED_FRONTMATTER = ("name", "description", "version")

# Each check pairs a human-readable label with a predicate over the file text.
def _frontmatter(text: str) -> str:
    # Grab the leading ---...--- block so a key in the body can't satisfy a check.
    m = re.match(r"^---\n(.*?)\n---", text, re.S)
    return m.group(1) if m else ""

CHECKS = {
    "frontmatter declares name, description, version":
        # [^\S\n]* matches spaces/tabs but not newlines, so an empty value fails.
        lambda t: all(re.search(rf"^{k}:[^\S\n]*\S", _frontmatter(t), re.M)
                      for k in REQUIRED_FRONTMATTER),
    "declares its operating modes":
        lambda t: "operating modes" in t.lower(),
    "states the never-publish-without-approval guardrail":
        lambda t: "never publish without approval" in t.lower(),
    "references the no-automated-posting compliance rule":
        lambda t: "automated posting" in t.lower(),
    "names the voice inputs it learns from":
        lambda t: "voice" in t.lower() and "past posts" in t.lower(),
}

def score(text: str) -> tuple[int, list[str]]:
    failed = [label for label, ok in CHECKS.items() if not ok(text)]
    return len(CHECKS) - len(failed), failed

def main(path: str) -> int:
    passed, failed = score(Path(path).read_text(encoding="utf-8"))
    print(f"SKILL.md score: {passed}/{len(CHECKS)}")
    for label in failed:
        print(f"  FAIL: {label}")
    # Non-zero exit on any failure is what turns this into a CI gate, not a report.
    return 1 if failed else 0

if __name__ == "__main__":
    raise SystemExit(main(sys.argv[1] if len(sys.argv) > 1 else "SKILL.md"))

These are structural checks, not behavioral ones. The scorer proves the rule is written down, not that the skill obeys it at runtime. That’s a real limit, and naming it is the point of the next-steps section. But “the guardrail is present and CI will fail if someone deletes it” is a meaningfully stronger guarantee than a prompt nobody verifies.

Mock the boundary to cover the glue

The five scripts the skill drives are almost pure glue: a LinkedIn client that makes network calls, a Playwright-based image renderer that drives a headless browser, an OAuth callback server that wants to bind a socket and open a browser tab. Glue like this is what teams point at when they say “you can’t really unit-test this,” and they’re half right. You can’t test it by letting it touch the real network, the real browser, or a real socket on every CI run.

The way through is test doubles at the architectural boundary. Martin Fowler’s taxonomy is the useful vocabulary: a stub answers queries with canned responses, a mock also verifies the calls it expected, and a fake is a working stand-in like an in-memory implementation (Martin Fowler: Test Double; Martin Fowler: Mocks Aren’t Stubs). The discipline is to put those doubles exactly at the edge where your code meets the outside world. Everything inside the boundary, which is the logic you actually wrote, runs for real. Take the LinkedIn client as the representative network case; the browser and socket doubles follow the same pattern. The five scripts live in a ghostwriter/ package, so this file is ghostwriter/linkedin.py; tests import it as from ghostwriter import linkedin, which is exactly what the --cov=ghostwriter flag in CI measures.

# ghostwriter/linkedin.py: the network glue under test
import httpx

# /v2/ugcPosts is LinkedIn's legacy share endpoint (the current one is /rest/posts
# with a LinkedIn-Version header); ghostwriter still targets the legacy API here.
API = "https://api.linkedin.com/v2/ugcPosts"

def publish_post(token: str, author_urn: str, text: str) -> str:
    """POST a share to LinkedIn and return the created post id."""
    resp = httpx.post(
        API,
        headers={
            "Authorization": f"Bearer {token}",
            # ugcPosts requires protocol 2.0.0 on every request.
            "X-Restli-Protocol-Version": "2.0.0",
        },
        json={
            "author": author_urn,
            "lifecycleState": "PUBLISHED",
            "specificContent": {
                "com.linkedin.ugc.ShareContent": {
                    "shareCommentary": {"text": text},
                    "shareMediaCategory": "NONE",
                }
            },
            "visibility": {"com.linkedin.ugc.MemberNetworkVisibility": "PUBLIC"},
        },
    )
    resp.raise_for_status()
    # The canonical place to read the new id is this response header.
    return resp.headers["x-restli-id"]

The test replaces httpx.post with a double that records the request and hands back a canned response, so the assertions verify both that the glue built the right call and that it read the new id from the reply the way callers depend on. A second test drives the error branch, behavior the success test never asserts.

# test_linkedin.py
import httpx
import pytest
from ghostwriter import linkedin

def test_publish_post_sends_expected_payload(monkeypatch):
    captured = {}

    # A stub+spy at the network boundary: return a canned response AND
    # record the call so the test can assert on it afterward.
    def fake_post(url, headers, json):
        share = json["specificContent"]["com.linkedin.ugc.ShareContent"]
        captured.update(url=url, auth=headers["Authorization"],
                        text=share["shareCommentary"]["text"])
        request = httpx.Request("POST", url)
        # The new id comes back in this response header.
        return httpx.Response(201, headers={"x-restli-id": "urn:li:share:123"},
                              request=request)

    monkeypatch.setattr(linkedin.httpx, "post", fake_post)
    post_id = linkedin.publish_post("tok", "urn:li:person:me", "hello world")

    # Behavior verification: assert on the call the glue made to the collaborator.
    assert captured["url"] == linkedin.API
    assert captured["auth"] == "Bearer tok"
    assert captured["text"] == "hello world"
    # State verification: it read the new id from the header the way callers rely on.
    assert post_id == "urn:li:share:123"

def test_publish_post_raises_on_api_error(monkeypatch):
    def fake_post(url, headers, json):
        return httpx.Response(401, request=httpx.Request("POST", url))
    monkeypatch.setattr(linkedin.httpx, "post", fake_post)
    with pytest.raises(httpx.HTTPStatusError):
        linkedin.publish_post("bad", "urn:li:person:me", "nope")

That is two tests covering both branches without touching the real network. The same pattern faked the Playwright page so the renderer thought it had a browser, and stood the OAuth callback server up against a loopback double so it never opened a tab or bound a real port. That is how all five scripts reached 100% line coverage across 94 tests.

Wire it into CI

The scorer, the coverage gate, and the bash lint all run on every push and pull request to main. The coverage gate is a flag, not a habit: --cov-fail-under=100 fails the job if a single line goes unexercised, so the number can’t quietly erode. The bash script gets shellcheck instead of a coverage number, because Python’s coverage tool can’t measure it and pretending otherwise would be dishonest.

# .github/workflows/ci.yml
name: ci
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install deps
        run: pip install -r requirements.txt pytest pytest-cov

      # Bash gets a static analyzer, not a fake coverage number. Preinstalled on the runner.
      - name: Shellcheck the release script
        run: shellcheck scripts/release-radar.sh

      # 100% line coverage or the job fails. The gate is the flag, not discipline.
      - name: Tests with coverage gate
        run: pytest --cov=ghostwriter --cov-report=term-missing --cov-fail-under=100

      # Score the prompt itself; the scorer's non-zero exit blocks the merge.
      - name: Score SKILL.md
        run: python scripts/score_skill.py SKILL.md

ShellCheck is a static analyzer that catches the quoting bugs, unset variables, and pitfalls that make shell scripts fail in ways the shell reports cryptically (ShellCheck), and it ships on GitHub’s Ubuntu runners, so the lint step needs no install. Each step exits non-zero on failure, so any one of the three can stop a merge on its own.

Verify: a missing guardrail fails the gate

The proof that the scorer is a gate and not decoration is watching it go red. Delete the guardrail line and the exit code flips to 1, which is exactly what the CI step keys on. The run below uses the trimmed 5-check version shown above; the real skill scores 8/8.

$ python scripts/score_skill.py SKILL.md
SKILL.md score: 5/5

# Drop the approval guardrail, and the gate turns red. The I flag makes the
# match case-insensitive, mirroring the scorer's own t.lower() comparison, so a
# guardrail written in sentence case is still deleted rather than silently kept:
$ sed -i '/never publish without approval/Id' SKILL.md
$ python scripts/score_skill.py SKILL.md
SKILL.md score: 4/5
  FAIL: states the never-publish-without-approval guardrail
$ echo $?
1

That exit code is the whole mechanism. If a future edit drops the rule that keeps the skill from posting without a human saying yes, the merge can’t land.

One honest caveat sits underneath all of this. Coverage is a negative metric, not a positive one. Fowler’s point is that coverage is good at finding code with no tests, but a high number says little about whether the tests assert anything useful, since 100% is reachable with assertion-free tests that exercise lines without checking results (Martin Fowler: Test Coverage). So 100% here means every Python line ran under test with real assertions around the logic. It does not mean the scripts are bug-free, and it says nothing at all about the bash script, which is why that one carries shellcheck’s name instead of a borrowed number.

The scorer’s honest limitation is that it’s structural. The follow-up is to extend it toward behavioral checks: small fixtures that feed the skill a voice profile and assert the generated drafts honor the voice rules, so CI verifies the skill does the right thing and not only that the rule is written down. The other thread is keeping the version-and-changelog discipline going, treating each future change as a real release under semantic versioning, where 0.x signals the surface is still allowed to move while the project finds its shape (Semantic Versioning 2.0.0).

Sources

Anthropic: Equipping agents for the real world with Agent Skills — a skill is a folder of a SKILL.md plus scripts, a structured artifact with required frontmatter and shape.
Martin Fowler: Test Double — the taxonomy of stubs, mocks, fakes, and spies for standing in for real collaborators.
Martin Fowler: Mocks Aren’t Stubs — the distinction between state verification and behavior verification at test boundaries.
Martin Fowler: Test Coverage — coverage finds untested code but is a poor measure of test quality; 100% is reachable with assertion-free tests.
ShellCheck — static analysis for shell scripts that catches quoting bugs, unset variables, and pitfalls before runtime.
Semantic Versioning 2.0.0 — major version zero is for initial development where the public surface may still change.

Changelog

[linkedin-ghostwriter] Treat the skill as code: score SKILL.md, 100% script coverage, v0.0.1 (#3) (ebe0b09)