Practical Guide to Evaluating and Testing Agent Skills
- Source: https://www.philschmid.de/testing-skills
- Author: Phil Schmid
- Clipped: 2026-03-07 (SGT)
TL;DR
Most agent skills are shipped with little or no testing. The article proposes a practical evaluation workflow: define measurable success, build a lightweight harness, run deterministic checks, and add model-based grading only where necessary.
Key ideas
- Skills are mostly instructions + scripts, so they should be tested like code.
- Define success before writing evals across three dimensions:
- Outcome (does it work)
- Instruction/style adherence (does it follow required patterns)
- Efficiency (tokens/time/retries)
- Start with a small test suite (10–20 prompts), including negative tests where the skill should not trigger.
- Execute tests through the same interface users/agents use (e.g., CLI runs), then capture outputs.
- Use deterministic checks (regex/structure rules) as the default for reliability and speed.
- Use LLM-as-judge only for qualitative criteria (design quality, structure, tone), with strict structured output schema.
- Iterate from failures: each real failure should become a new test case.
Practical workflow from the post
- Manually trigger skill to expose assumptions.
- Build prompt set with expected checks.
- Run harness and collect outputs/stats.
- Apply check registry per test.
- Improve skill instructions/description.
- Re-run multiple trials to account for non-determinism.
Notable takeaway
In the author’s case study, improving skill description clarity and replacing soft guidance with explicit directives improved pass rate from 66.7% to 100%.