ENGINEERING SURFACE

What WinDAGs is, in code.

One page. Subsystems with file links, real numbers, what's shipped vs in flight, open questions. If you'd rather read code than copy, start here.

workgroup-ai on GitHub Benchmarks: Skill Graft beats vanilla Sonnet Cascade deep dive

Numbers

Each row is a number I can produce on demand from the repo or a one-liner. No vanity metrics — only ones a reviewer can verify.

551

Skills in catalog

Hand-written SKILL.md, frontmatter + body + reference manifest

~84k

Lines of TS in core

DAG executor, retrieval, observability, runtimes, types

94+

Test files

Vitest, colocated with source

~3.2 MB

Tool2Vec cache size

551 × 384d float32 (MiniLM-L6-v2) + synthetic task list per skill

~$0.20

Cache build cost

OpenAI gpt-4o-mini for tasks, text-embedding-3-small for vectors

3.1 min

Cache build time

Concurrency 10, full 551-skill catalog, no failures

~25 ms

Per-query latency (warm)

BM25 + Tool2Vec cosine + RRF, all in-process

Cold-start cost

Plugin install ships with BM25 ready; cascade stages opt in

Subsystems

Every row links to the file in the repo. Status reflects what's actually running in the bundled plugin and the desktop app today, not what's designed.

DAG executor

SHIPPED

Topologically sorts subtasks into waves, runs each wave's nodes as parallel claude -p processes, propagates aborts via SIGTERM → SIGKILL. Zero token overhead per node.

Skill retrieval cascade

SHIPPED

Six stages: BM25 → Tool2Vec → RRF fusion → cross-encoder rerank → local attribution k-NN → cross-user global priors. One scoring path; the DAG executor, MCP server, slash skill, and shortlister all call SkillSearchService.search(). The cascade is what gets the agent's first response to look like the one a senior engineer would have written.

Tool2Vec cache generator

SHIPPED

Pre-computes per-skill task-space vectors. For each skill: ask Haiku (or gpt-4o-mini / gemini-2.5-flash, whichever key is present) for 15 diverse task descriptions; embed each via Xenova/all-MiniLM-L6-v2 (384d, local Transformers.js) — same model used at runtime so the cosine is honest; average the 15 vectors; L2-normalize. Provider chain (Anthropic → OpenAI → Gemini) auto-falls-through on quota/auth errors. .env.local autoload up the directory tree.

windags-skills/scripts/build-tool2vec-cache.mjs

MCP server (bundled)

SHIPPED

Self-contained MCP server distributed with the windags-skills plugin. As of v2.8.0 it exposes 9 tools: the four retrieval primitives (windags_skill_search/graft/reference/history with the full local cascade — BM25 + MiniLM + RRF + cross-encoder + per-user attribution k-NN + cross-user global priors) plus five planner-grade tools (skill_search_batch and skill_graft_batch for N-in-one round-trips, node_requirements with provider-native model IDs, validate_dag for schema-checking before save, and estimate_cost for planning-time cost surfacing). Reads are local + cheap aggregates from api.windags.ai (no API key); writes are anonymized telemetry, opt-out via WINDAGS_TELEMETRY=off.

Cost gate + human-in-the-loop

SHIPPED

CostCalculator returns a per-node + per-wave estimate before execution. The onCostEstimate callback in DAGExecutorOptions can return false to abort. PhaseOrchestrator forwards the gate to each phase's inner executor.

Attribution + skill sharpening

PARTIAL

Five orthogonal signals per graft (completion, accept, downstream rating, self-confidence, shipped) keyed by query embedding in ~/.windags/skill-state.db. Stage 1 (deterministic Floor + Wall + Envelope) runs after every node. Stage 2 (Haiku-based eval) is wired but the haikuCallFn isn't configured by default.

Streaming execution events

SHIPPED

ExecutionEvent → DAGStateEvent via event-mapper. WebSocket server with rooms, buffering, heartbeat, sequence numbers + replay. useDAGStateEvents hook dispatches to a Zustand reducer. Handles node_status_change, wave_transition, cost_update, human_gate_waiting, evaluation, mutation.

Topology dispatch

PARTIAL

Seven topologies built: DAG, Team Loop, Swarm, Blackboard, Team Builder, Recurring, Workflow. Core executors exist for all seven. /api/execute today only dispatches DAG; rolling out the others is gated on UI controls.

packages/core/src/topologies/

Workers API (api.windags.ai)

IN FLIGHT

Cloudflare Worker over D1 FTS5. Implements /v1/graft today (server-side BM25, returns full skill bodies). Tool2Vec via KV + Workers AI cross-encoder is the next-release plan — eliminates the 80 MB local model dependency entirely.

apps/api/src/index.ts

What's next

Concrete commitments, not aspirations. Each row corresponds to a specific commit or release.

Surface	What
Plugin v2.9.0	/next-move in the MCP — prompt capability + a headless `windags_run_pipeline` tool so non-Claude clients can invoke the full 5-agent meta-DAG without owning a slash-command runtime.
Workers API	Optional Workers AI rerank as a fallback when the local ONNX bundle isn't installed; same scores, different supply chain.
Bench	Vanilla Sonnet 4.7 vs Sonnet + WinDAGs on a 50-prompt held-out set; report at /engineering/benchmarks.
Eval	Haiku-based Stage 2 evaluator; needs a default haikuCallFn wired into createQualityHooks().

Open questions

Things I haven't resolved. If you have an opinion on any of these, I'd actually like to hear it.

Cross-encoder rerank served from Workers AI vs. shipping a signed local ONNX bundle via brew tap — same UX, very different supply-chain story.
How much the global-priors blend (Stage 6) actually changes cascade picks once the network has volume. Today the blend weight is capped at 0.2 and decomposed across three nested tiers (manifest / exact-version / any-version). The empirical question: does the cross-user signal beat the local k-NN often enough to justify the 6h cache + adaptive refresh, or do most installs converge to picks that local k-NN would have made anyway?
How much the Tool2Vec cache drifts as the catalog evolves. Content-hash invalidation works per-skill; the empirical question is how often skills get edited enough to invalidate.
Whether the bandit framing (deleted) was actually wrong, or whether we just hadn't built enough attribution data to make it work. Current bet: it was a category error. Open to being shown otherwise.

Built by Curiositech. Source available under BSL 1.1. Public catalog at curiositech/windags-skills.