
What happens when two skill-improvers improve each other?
When Meta-Skills Collide
We built a cross-evaluation agent, pointed Anthropic's skill-creator and our skill-architect at each other, and recorded what happened. Real transcripts. Real diffs. Real analysis of what two different philosophies of skill-building value.
TL;DR: We ran two competing AI skill-evaluation tools against each other — Anthropic's open-source skill-creator and our skill-architect. Both violated their own rules. Neither caught what the other caught. Self-evaluation has a structural blind spot that cross-evaluation fills. The experiment converged to A- from both directions.
Contenders — factory vs. library
Evaluate — initial scorecards
Improve — mutual fixes
Cross-Spiral — the blind spots
Conclusion — 7 learnings
In November 2025, right after I launched someclaudeskills.com, I noticed a gap. Some skills I knew from the inside out — hard-won career knowledge I'd been carrying around for years. (Production systems frequently do better with logistic regression on simple obvious features than with carefully-built bespoke ML. If you're rigging 3D avatar blendshapes, plan your correctives before you're done with the base shapes, not after.) Others I had basically no idea how to build. I was just writing down what I thought I knew and hoping it would hold up when an agent actually tried to use it.
That gap bothered me. So I built Skill-Coach — a meta-skill whose whole job is to look at other skills and make them better. And then immediately: why not run it on itself?

Our first time meta-skilling in November 2025. Skill-Coach Improves Itself: 5 Iterations of Meta-Improvement
A skill (in the Claude Code skills framework) is a SKILL.md file — a structured document that injects domain expertise into an agent before it runs. A meta-skill is a skill whose job is to evaluate and improve other skills. We have two of them. One is Anthropic's, open-sourced under Apache 2.0. One is ours. We pointed them at each other and recorded what happened.
3 evaluation agents
1,325 lines of HTML viewer
Philosophy: The Factory Floor
4 validation scripts
Anti-pattern catalog + rubric
Philosophy: The Library
Skill Creator (SC) is Anthropic's meta-skill: 10 Python scripts, 3 evaluation agents, 1,325 lines of interactive HTML viewer. A factory floor. Skill Architect (SA) is WinDAGs' meta-skill: 13 reference documents, a scoring rubric, an anti-pattern catalog, 4 validation scripts. A library.
Here's the map — SA and SC are functions, SA(SC) means "skill-architect evaluates and improves skill-creator," ∘ is composition. Click any node to see the file tree and diffs for that version:
Composition Algebra: Click a Node
The Original Meta-Skill
Skill-Coach was the experiment that started all of this. Its job: look at any skill and make it better. Better triggering, cleaner structure, more honest about what it can and can't do. Eight reference files, a recursive self-improvement workflow, and then: run it on itself five times.
skill-coach/
├── SKILL.md ~400 lines
├── CHANGELOG.md
├── scripts/
│ ├── validate_skill.py
│ ├── check_self_contained.py
│ └── test_activation.py
└── references/
├── antipatterns.md anti-pattern catalog with case studies
├── shibboleths.md expert vs novice vocabulary patterns
├── validation-checklist.md complete review and testing guide
├── self-contained-tools.md scripts, MCPs, and subagent patterns
├── scoring-rubric.md quantitative 0-10 skill evaluation
├── skill-composition.md cross-skill dependencies
├── skill-lifecycle.md versioning and deprecation
└── mcp_vs_scripts.md when to use Skills vs Agents vs MCPs
How It Works
Skill-Coach applies a six-step creation process and a progressive disclosure philosophy:
- Phase 1 (~100 tokens): Metadata — "should I activate?"
- Phase 2 (<5k tokens): SKILL.md — "how do I do this?"
- Phase 3 (as needed): References — "show me the details"
The description formula: [What] [Use for] [Keywords] NOT for [Exclusions]. Its own description is the example: "Guides creation of high-quality Agent Skills... Activate on: create skill, review skill, skill quality... NOT for general coding advice, slash commands, MCP development."
The recursive self-improvement workflow uses its own scripts:
python scripts/validate_skill.py <path> # structural check
python scripts/check_self_contained.py <path> # phantom reference check
python scripts/test_activation.py <path> # activation rate check
Address ERRORS first, then WARNINGS, then SUGGESTIONS. Update CHANGELOG.md. Re-run until clean.
What Skill-Coach^5 Found About Itself
Each generation found something the previous one was too close to see:
| Generation | What It Found | What Changed |
|---|---|---|
| SK₀ → SK₁ | Description triggered on "make my prompt better" | Narrowed to skill-specific vocabulary |
| SK₁ → SK₂ | Improvement workflow assumed full folder access without saying so | Added explicit folder-reading step |
| SK₂ → SK₃ | No NOT clause — fired on generic "quality review" queries | Added exclusion for non-skill content |
| SK₃ → SK₄ | scoring-rubric.md referenced criteria not defined in it |
Added definitions, linked to examples |
| SK₄ → SK₅ | Shibboleths section wasn't itself written using shibboleths | Rewrote using domain vocabulary throughout |
By SK₅: tighter triggering, self-consistent examples, a workflow that matched its own structure. Cleaner, not longer.
Why It Matters Here
Skill-Coach established that a meta-skill can improve itself. SA and SC are meta-skills with different improvement philosophies. The question this experiment asks: what happens when you cross-apply them instead of self-applying them? Do they find the same things?
They don't.
Thank You, Anthropic
This experiment exists because Anthropic open-sourced their skill creation tooling under the Apache 2.0 license. They didn't have to. The license permits reproduction, derivative works, and public display with attribution. We include their complete skill-creator with full provenance — every file, every script, every agent definition. For the latest version: their repository. Everything here is a snapshot from commit b0cbd3df, March 7, 2026.
The Setup
Each skill got its own source folder, the target folder (read-only), and a writable output copy. Tools: Read, Write, Edit, Glob, Grep, Bash.
claude -p "You are Skill Architect. Your source folder: {sa_path}.
Evaluate and improve skill-creator at: {sc_path}.
Write all improvements to: {output_path}." \
--allowed-tools Read,Write,Edit,Glob,Grep,Bash(python:*) \
--permission-mode bypassPermissions
Each agent read its own skill folder — references, scripts, examples — before scoring the target. Layer 1: SKILL.md only. Layer 2: full folder access. Layer 3: after each skill self-improved, cross-evaluate again.
Three rounds. Six crossings. The braid is below.
Evaluation Matrix
All four evaluator × target combinations. Cross-evaluations found what self-evaluations missed. Score shown as before → after.
| Evaluator | Target | Round 1 | Round 2 | Round 3 |
|---|---|---|---|---|
| SA skill-architect | SC skill-creator | 5.1/10
| 6.0→8/10
| 8.2→8.8/10
|
| SC skill-creator | SA skill-architect | 6.7/10
| 7.6/10
| 8.8→8.97/10
|
| SA skill-architect | SA (self) | 7.3/10
| 7.3→8.8/10
| 8.8→8.9/10
|
| SC skill-creator | SC (self) | 7/10
| 7.0→8.2/10
| 8.2→8.6/10
|
Cross-Evaluation
- +Added 2 Mermaid diagrams (creation flow + eval loop)
- +Added NOT clause to description
- +Added shibboleth section with templates
- +Moved platform sections → references/ (~70 lines)
- +Added description-optimization.md
- +Added platform-notes.md
- +Reduced SKILL.md 485→383 lines
- +Added reference index
- +Found 525-line violation (SC's own 500-line rule)
- +Added missing eval-loop Mermaid diagram
- +Added NOT clause to description
- +Created CHANGELOG.md
- +Rewrote description for trigger clarity
- +Added Output Contracts section
- +Added activation flowchart
- +Identified 505-line violation
- +Fixed validate_mermaid.py bug (identical error/warning icons)
- +Removed phantom reference in antipatterns.md
- +Converted 6 ASCII diagrams → Mermaid
- +Added troubleshooting.md
- +Found EVALUATION.md phantom self-contamination
- +Extended validate_skill.py to scan all .md files
- +Fixed HTML entities in 4 reference files
- +Added <!-- phantom-ok --> annotation support
Self-Evaluation
- +Unrestricted Bash violates own least-privilege rule
- +Anti-patterns section doesn't use own shibboleth template
- +23-type Mermaid table bloating SKILL.md (belongs in references)
- +Progressive Disclosure scored 6/10 — content at wrong layer
- +Scoped Bash to Bash(python:*) — fixed own least-privilege violation
- +Self-Containment 6→9 (+3): fixed all 7 phantom references
- +Visual Artifacts 5→9 (+4): added progressive disclosure diagram
- +Reduced 504→467 lines, still missed EVALUATION.md phantom
- +R2's 8.8 was inflated: re-assessed as 7.4 (broken Mermaid, invented keys, phantoms)
- +Fixed 31+ HTML entities breaking Mermaid rendering in 5 reference files
- +Removed invented frontmatter keys contradicting own Invalid Keys guidance
- +Compressed SKILL.md 466→381 lines; added Self-Consistency as 7th dimension
- +Description not 'pushy' — violates own optimization advice (ironic)
- +run-1/ path missing breaks aggregate_benchmark.py (functional bug)
- +No grader subagent prompt template despite documenting grader flow
- +Self-Containment 4/10 — no resource inventory, graceful degradation
- +Rewrote description imperative and 'pushy' (own medicine)
- +Added resources inventory with graceful degradation paths
- +Fixed run-1/ paths in all output templates
- +Reordered grader steps 7↔8 (can't write timing before reading it)
- +Fixed description voice: second-person → imperative (own medicine, again)
- +Fixed aggregate_benchmark.py eval_id parsing — root cause, not just docs
- +Removed "Cool? Cool." colloquialism breaking instructional tone
- +Added eval_metadata.json schema to schemas.md (referenced but undefined)
Score Evolution
Four evaluation paths. Hover any path to isolate it and see who is evaluating whom.
Hover a path in the legend to see the full story.
File Evolution Explorer
Pick a journey: who evaluated whom? Then browse every file across all rounds. Colored dots show which versions contain each file.
Loading...
Loading...
Loading...