The 84/100 Benchmark
I ran skillscore 0.2.0 on all 24 skills from addyosmani/agent-skills — a repo with 52,000 stars and #1 trending on GitHub. The average score: 84/100 (B). No failures, no D grades. But five skills scored C (77-78): api-and-interface-design, debugging-and-error-recovery, git-workflow-and-versioning, idea-refine, and performance-optimization.
Every C-grade skill had the same two issues. Neither affects instruction quality. Both are invisible in human review.
Gap 1: No Stop Condition
Every C-grade description says what the skill does. None says when not to use it.
An agent keeps all skill descriptions in context on every turn. Without a boundary clause, the agent may invoke the skill for loosely related requests. The fix is one sentence at the end:
Do not use when the codebase already has an established pattern for this.
That's it. One sentence. The skill immediately becomes less likely to activate on the wrong request.
Gap 2: Missing Safety Section
Several C-grade skills ship step-by-step terminal commands in the body. None has a ## Safety section.
The Google Antigravity authoring guide requires any skill that runs commands to document what those commands touch and what the agent must never run unattended. Without it, skillscore applies up to an 8-point penalty.
Here's what a Safety section looks like:
## Safety
- Never run `git push --force` unattended. Confirm with the user first.
- All destructive commands require explicit confirmation before execution.
- Scripts in scripts/ are reviewed before running, never piped directly to sh.
Five lines. Eight points back.
Add both fixes and every C-grade skill moves to B or A territory. The instruction quality is already there — the metadata layer just needed these two signals.
How to Run It Yourself
# Install
dart pub global activate skillscore
# Score your skills
skillscore path/to/your-skills/
# Gate CI (fail if any skill drops below 80)
skillscore skills/ --min-score 80 --format sarif
--format sarif pipes findings into GitHub code scanning as inline annotations on pull requests. No more "I forgot to check the skill before merging."
If a finding is unclear, skillscore explain prints the full rationale and the guide it came from. Every output line includes the rule ID.
Fully offline. No API key. Deterministic. The same input always produces the same score — the only way to use something in a CI gate.
What This Proves
The gaps in the C-grade skills are invisible in normal review. If you read performance-optimization cold, you'd probably call it good because the instructions are good. A human reviewer won't flag the absence of a boundary clause or notice the missing Safety section.
A linter doesn't read. It checks. The most common quality gap in real-world agent skills is not bad instructions — it's the two or three structural signals the agent uses to decide when and whether to invoke the skill at all.
That's a solvable problem. Now you have a number for it.
Try It
- skillscore on pub.dev
- skillscore on GitHub
- addyosmani/agent-skills — the library used in this post



