Autoresearch, Applied to Skills

Karpathy released autoresearch — a ~630-line single-GPU LLM training setup where you write a prompt in markdown and an AI agent iterates on the training code autonomously. No human in the loop between experiments. 700 experiments in 2 days. Shopify’s CEO tried it overnight: 37 experiments, 19% performance gain. Karpathy called it “the final boss battle” for frontier labs.

The core idea is that the loop itself — run, score, mutate, keep — is what matters. The target doesn’t have to be ML training code.

The autoresearch skill applies this same loop to Claude Code skills. Feed it a SKILL.md, it runs the skill repeatedly, scores outputs against binary evals, mutates the prompt, and keeps what improves. You get back a better skill, a results log, and a full mutation history.

And it’s not just skills — CLAUDE.md, MCP server prompts, agent system prompts — anything that’s a prompt with a measurable output fits the same loop. Skills were just the obvious first target.