What happens when two skill-improvers improve each other?

skillsmetaanthropicevaluationrecursioncomposition

When Meta-Skills Collide

We built a cross-evaluation agent, pointed Anthropic's skill-creator and our skill-architect at each other, and recorded what happened. Real transcripts. Real diffs. Real analysis of what two different philosophies of skill-building value.

TL;DR: We ran two competing AI skill-evaluation tools against each other — Anthropic's open-source skill-creator and our skill-architect. Both violated their own rules. Neither caught what the other caught. Self-evaluation has a structural blind spot that cross-evaluation fills. The experiment converged to A- from both directions.

What's inside
Inspo — Skill-Coach^5 origin story
Contenders — factory vs. library
Evaluate — initial scorecards
Improve — mutual fixes
Cross-Spiral — the blind spots
Conclusion — 7 learnings

In November 2025, right after I launched someclaudeskills.com, I noticed a gap. Some skills I knew from the inside out — hard-won career knowledge I'd been carrying around for years. (Production systems frequently do better with logistic regression on simple obvious features than with carefully-built bespoke ML. If you're rigging 3D avatar blendshapes, plan your correctives before you're done with the base shapes, not after.) Others I had basically no idea how to build. I was just writing down what I thought I knew and hoping it would hold up when an agent actually tried to use it.

That gap bothered me. So I built Skill-Coach — a meta-skill whose whole job is to look at other skills and make them better. And then immediately: why not run it on itself?

Skill-Coach Improves Itself: 5 Iterations of Meta-Improvement — our first time meta-skilling in November 2025

Our first time meta-skilling in November 2025. Skill-Coach Improves Itself: 5 Iterations of Meta-Improvement


A skill (in the Claude Code skills framework) is a SKILL.md file — a structured document that injects domain expertise into an agent before it runs. A meta-skill is a skill whose job is to evaluate and improve other skills. We have two of them. One is Anthropic's, open-sourced under Apache 2.0. One is ours. We pointed them at each other and recorded what happened.

SKILL CREATOR (SC)
10 Python scripts
3 evaluation agents
1,325 lines of HTML viewer
Philosophy: The Factory Floor
SKILL ARCHITECT (SA)
13 reference documents
4 validation scripts
Anti-pattern catalog + rubric
Philosophy: The Library

Skill Creator (SC) is Anthropic's meta-skill: 10 Python scripts, 3 evaluation agents, 1,325 lines of interactive HTML viewer. A factory floor. Skill Architect (SA) is WinDAGs' meta-skill: 13 reference documents, a scoring rubric, an anti-pattern catalog, 4 validation scripts. A library.

Here's the map — SA and SC are functions, SA(SC) means "skill-architect evaluates and improves skill-creator," ∘ is composition. Click any node to see the file tree and diffs for that version:

Composition Algebra: Click a Node

SC evaluatesSA self-evalSC self-evalSA evaluatescross vs selfSA₀7.2/10SC ∘ SA~8/10 est.SA ∘ SA8.9/10SC ∘ SC8.6/10SC₀4.7/10SA ∘ SC~7/10 est.

The Original Meta-Skill

Skill-Coach was the experiment that started all of this. Its job: look at any skill and make it better. Better triggering, cleaner structure, more honest about what it can and can't do. Eight reference files, a recursive self-improvement workflow, and then: run it on itself five times.

skill-coach/
├── SKILL.md                           ~400 lines
├── CHANGELOG.md
├── scripts/
│   ├── validate_skill.py
│   ├── check_self_contained.py
│   └── test_activation.py
└── references/
    ├── antipatterns.md                anti-pattern catalog with case studies
    ├── shibboleths.md                 expert vs novice vocabulary patterns
    ├── validation-checklist.md        complete review and testing guide
    ├── self-contained-tools.md        scripts, MCPs, and subagent patterns
    ├── scoring-rubric.md              quantitative 0-10 skill evaluation
    ├── skill-composition.md           cross-skill dependencies
    ├── skill-lifecycle.md             versioning and deprecation
    └── mcp_vs_scripts.md              when to use Skills vs Agents vs MCPs

How It Works

Skill-Coach applies a six-step creation process and a progressive disclosure philosophy:

  • Phase 1 (~100 tokens): Metadata — "should I activate?"
  • Phase 2 (<5k tokens): SKILL.md — "how do I do this?"
  • Phase 3 (as needed): References — "show me the details"

The description formula: [What] [Use for] [Keywords] NOT for [Exclusions]. Its own description is the example: "Guides creation of high-quality Agent Skills... Activate on: create skill, review skill, skill quality... NOT for general coding advice, slash commands, MCP development."

The recursive self-improvement workflow uses its own scripts:

python scripts/validate_skill.py <path>       # structural check
python scripts/check_self_contained.py <path>  # phantom reference check
python scripts/test_activation.py <path>       # activation rate check

Address ERRORS first, then WARNINGS, then SUGGESTIONS. Update CHANGELOG.md. Re-run until clean.

What Skill-Coach^5 Found About Itself

Each generation found something the previous one was too close to see:

Generation What It Found What Changed
SK₀ → SK₁ Description triggered on "make my prompt better" Narrowed to skill-specific vocabulary
SK₁ → SK₂ Improvement workflow assumed full folder access without saying so Added explicit folder-reading step
SK₂ → SK₃ No NOT clause — fired on generic "quality review" queries Added exclusion for non-skill content
SK₃ → SK₄ scoring-rubric.md referenced criteria not defined in it Added definitions, linked to examples
SK₄ → SK₅ Shibboleths section wasn't itself written using shibboleths Rewrote using domain vocabulary throughout

By SK₅: tighter triggering, self-consistent examples, a workflow that matched its own structure. Cleaner, not longer.

Why It Matters Here

Skill-Coach established that a meta-skill can improve itself. SA and SC are meta-skills with different improvement philosophies. The question this experiment asks: what happens when you cross-apply them instead of self-applying them? Do they find the same things?

They don't.

Thank You, Anthropic

This experiment exists because Anthropic open-sourced their skill creation tooling under the Apache 2.0 license. They didn't have to. The license permits reproduction, derivative works, and public display with attribution. We include their complete skill-creator with full provenance — every file, every script, every agent definition. For the latest version: their repository. Everything here is a snapshot from commit b0cbd3df, March 7, 2026.

The Setup

Each skill got its own source folder, the target folder (read-only), and a writable output copy. Tools: Read, Write, Edit, Glob, Grep, Bash.

claude -p "You are Skill Architect. Your source folder: {sa_path}.
           Evaluate and improve skill-creator at: {sc_path}.
           Write all improvements to: {output_path}." \
  --allowed-tools Read,Write,Edit,Glob,Grep,Bash(python:*) \
  --permission-mode bypassPermissions

Each agent read its own skill folder — references, scripts, examples — before scoring the target. Layer 1: SKILL.md only. Layer 2: full folder access. Layer 3: after each skill self-improved, cross-evaluate again.

Three rounds. Six crossings. The braid is below.

Evaluation Matrix

All four evaluator × target combinations. Cross-evaluations found what self-evaluations missed. Score shown as before → after.

Cross-Evaluation

SA evaluates SC — R1
5.1/10
  • +Added 2 Mermaid diagrams (creation flow + eval loop)
  • +Added NOT clause to description
  • +Added shibboleth section with templates
  • +Moved platform sections → references/ (~70 lines)
SA evaluates SC — R2
6.08/10
  • +Added description-optimization.md
  • +Added platform-notes.md
  • +Reduced SKILL.md 485→383 lines
  • +Added reference index
SA evaluates SC — R3
8.28.8/10
  • +Found 525-line violation (SC's own 500-line rule)
  • +Added missing eval-loop Mermaid diagram
  • +Added NOT clause to description
  • +Created CHANGELOG.md
SC evaluates SA — R1
6.7/10
  • +Rewrote description for trigger clarity
  • +Added Output Contracts section
  • +Added activation flowchart
  • +Identified 505-line violation
SC evaluates SA — R2
7.6/10
  • +Fixed validate_mermaid.py bug (identical error/warning icons)
  • +Removed phantom reference in antipatterns.md
  • +Converted 6 ASCII diagrams → Mermaid
  • +Added troubleshooting.md
SC evaluates SA — R3
8.88.97/10
  • +Found EVALUATION.md phantom self-contamination
  • +Extended validate_skill.py to scan all .md files
  • +Fixed HTML entities in 4 reference files
  • +Added <!-- phantom-ok --> annotation support

Self-Evaluation

SA self-eval — R1
7.3/10
  • +Unrestricted Bash violates own least-privilege rule
  • +Anti-patterns section doesn't use own shibboleth template
  • +23-type Mermaid table bloating SKILL.md (belongs in references)
  • +Progressive Disclosure scored 6/10 — content at wrong layer
SA self-eval — R2
7.38.8/10
  • +Scoped Bash to Bash(python:*) — fixed own least-privilege violation
  • +Self-Containment 6→9 (+3): fixed all 7 phantom references
  • +Visual Artifacts 5→9 (+4): added progressive disclosure diagram
  • +Reduced 504→467 lines, still missed EVALUATION.md phantom
SA self-eval — R3
8.88.9/10
  • +R2's 8.8 was inflated: re-assessed as 7.4 (broken Mermaid, invented keys, phantoms)
  • +Fixed 31+ HTML entities breaking Mermaid rendering in 5 reference files
  • +Removed invented frontmatter keys contradicting own Invalid Keys guidance
  • +Compressed SKILL.md 466→381 lines; added Self-Consistency as 7th dimension
SC self-eval — R1
7/10
  • +Description not 'pushy' — violates own optimization advice (ironic)
  • +run-1/ path missing breaks aggregate_benchmark.py (functional bug)
  • +No grader subagent prompt template despite documenting grader flow
  • +Self-Containment 4/10 — no resource inventory, graceful degradation
SC self-eval — R2
7.08.2/10
  • +Rewrote description imperative and 'pushy' (own medicine)
  • +Added resources inventory with graceful degradation paths
  • +Fixed run-1/ paths in all output templates
  • +Reordered grader steps 7↔8 (can't write timing before reading it)
SC self-eval — R3
8.28.6/10
  • +Fixed description voice: second-person → imperative (own medicine, again)
  • +Fixed aggregate_benchmark.py eval_id parsing — root cause, not just docs
  • +Removed "Cool? Cool." colloquialism breaking instructional tone
  • +Added eval_metadata.json schema to schemas.md (referenced but undefined)

Score Evolution

Four evaluation paths. Hover any path to isolate it and see who is evaluating whom.

Hover a path in the legend to see the full story.

5678910Round 1(SKILL.md only)Round 2(full folder)Round 3(cross-spiral)5.1 (C)Architect evaluates Creator8 (B)8.8 (A-)6.7 (C+)Creator evaluates Architect7.6 (B-)8.97 (A-)7.3 (B-)Architect evaluates Architect8.8 (A-)8.9 (A-)7 (B-)Creator evaluates Creator8.2 (B)8.6 (B+)

File Evolution Explorer

Pick a journey: who evaluated whom? Then browse every file across all rounds. Colored dots show which versions contain each file.

Target: Skill-Architect
Target: Skill-Creator
Evaluator: SA
Evaluator: SC
SA evaluates SCskill-architect improves skill-creator
SC₀ (5.1)SC₁ (8.0)SC₂ (8.8)
SC₀ (original)

Loading...

SC₁ (SA-improved)

Loading...

SC₂ (SA₁-improved)

Loading...

Discussion
Scroll down to load comments