The Loop That Improves Almost Anything
Earlier this year Andrej Karpathy published a little thing called autoresearch, and it has been rattling around in my head ever since. It was a few hundred lines of Python that let an AI agent improve a machine learning model on its own, overnight, while he slept. The agent edits one file, one metric, one five-minute window per attempt. Keep the change if the model got better, throw it away if it didn’t, and go again. By morning it had run dozens of experiments and found real improvements he hadn’t told it about.
That’s the narrow version. The more I looked at it, the more I became convinced it was hiding a much bigger one. What Karpathy had really built was a pattern, and once you see the pattern you can point it at almost anything — code, prompts, a network of agents, even the words you’re reading right now. Something you’ve already made that’s pretty good, ground down round after round until it’s the best you can get.
If machine learning isn’t your world, don’t worry. That’s just where the idea started. By the end I’ll have the same loop sharpening an ordinary written report, and then this very post. Let me explain how it works, and then show you how I turned it into a skill you can install in Claude Code in about thirty seconds.
The loop
Here’s the original idea, stripped to its bones.
A language model runs a loop. On each turn of the loop, it produces a version of whatever it’s trying to optimize. In Karpathy’s case, that was the Python code that trains a machine learning model. Call that thing the artifact.
The artifact then gets tested, and the test produces a single number that says how good it is. In Karpathy’s setup that was the model’s error after a short burst of training: lower is better. Call that number the fitness.
The model sees the result. It knows what it tried and how well that attempt scored. If this artifact beat everything that came before, it gets stored as the new best. Then the model starts another turn, using everything it learned from the previous attempts to try to do better still. And so on, round after round.
So far this sounds like ordinary optimization. But there’s a twist that makes it special. The model isn’t just nudging a few numbers in a program that already exists. It’s free to write new code, to invent approaches nobody tried. It’s playing the part a human scientist usually plays: staring at the last result, having an idea, and chasing it.
The pattern hiding inside
That’s the part that grabbed me. Nothing in that loop is actually about machine learning.
Think about the ingredients. You need an artifact you can change. You need a way to measure how good it is. And you need a model that can read that measurement and play a hunch about what to try next. Machine learning code has all three, but so do a thousand other things.
You could improve a report you’ve written. Some code. A prompt for a language model. A whole network of agents working together. Anything you can look at, change, and score.
The one place Karpathy had it easy was the measurement. His training run spits out a single score, and that score settles the matter with no argument. Out in the wider world you rarely get that. But you don’t need it. Your measure of fitness can be anything you like — including the judgment of a subagent (a second AI you hand one small job) that reads the work and tells you what it thinks. And you’re not stuck with one number, either. You can use a whole rubric: a short list of things you care about, each scored on its own, the way a teacher grades an essay for argument, evidence, and style rather than one overall mark.
Say I’m writing a report. The report has to land near a target length, so I’ll write a little script that counts the words. But length isn’t the point — I want the writing to be good. So I add more measures, each one a subagent with a job. One judges how easy the report is to read. One checks that it’s factually sound. One looks at the narrative and the flow. One asks whether it actually suits its purpose. I might say all of those count equally, except the word count, which matters more than the rest.
Then I hand the loop my rough first draft and let it run. It calls the subagents, runs the counting script, reads the scores, decides what to change for the next round, and goes again. Each round it keeps the version that scored better and throws the rest away, so the draft only ever moves uphill.
betterbest
This pattern felt too useful to leave as a loose idea, so I packaged it into a Claude Code skill called betterbest.
I built a rough first version with Claude. Then I pointed that version at itself and let it rewrite its own instructions, generation after generation. The skill improved the skill, and then improved at improving itself. That odd kind of recursion is pretty standard in the programming world.
And the post you’re reading is the same trick run one more time. I gave the loop my original scrappy notes as the seed — the ideas you’ve just read are all mine. The words are not. They’re the output of betterbest running its rounds, with one subagent checking my facts, another reading for clarity and voice, another watching the length, another grading whether the ideas land. The draft you’d have gotten from my raw notes was the “good.” Round by round, the critics pushed it to better, and then to this.
The draft I started with — the “good” one — is still saved in my version history a few rounds back, worse in every way the critics measured.
If that makes you want to point it at something of your own, here’s how.
Installing it
betterbest is a Claude Code plugin, and getting it takes three commands. In Claude Code, add the marketplace it lives in, install the plugin, then reload:
/plugin marketplace add drewmccormack/betterbest
/plugin install betterbest@betterbest
/reload-plugins
That’s it. After the reload, the skill is available. Give it something you’ve already got that’s decent but not finished: a draft you wrote, a prompt that’s almost right, a config you keep fiddling with. Tell it how you’d know “better” if you saw it. Then let it climb.
The thing only has to be good enough to start. betterbest takes it from there.
Appendix A: the seed
For the curious, here are the original point-form notes I handed the loop as the seed. The post above is what betterbest made of them. Spelling and layout are lightly tidied; the ideas and wording are otherwise mine, untouched.
I don’t have any prose here, but I have a plan.
I want to write a blog post about the betterbest skill, and the idea of using autoresearch applied to just about anything. It is really a pattern, rather than a fixed algorithm.
TLDR; Karpathy’s autoresearch is a pattern that can be applied to improve just about anything, from code to LLM prompts and even prose.
Introduce autoresearch from Karpathy. It was introduced to optimize ML architectures.
The idea is that an LLM runs a loop. Each iteration of the loop, it generates a version of what it is optimizing, eg, the python code that generates a ML network.
The “artifact”, eg the python code, is then tested to produce a measure of how good it is, ie, a fitness. In the ML example, this is the loss function.
The LLM can see this result, and knows what it tries. If this artifact is better than previous attempts, it is stored as the best.
The LLM then starts a new iteration, builds on knowledge from previous iterations to try to make an even better artifact. And so forth.
This is a basic optimization loop, with an important caveat. It isn’t just tweaking parameters in some existing program, it is generating potentially completely new algorithms. You can think of it as the LLM taking over the role that a scientist would typically play, using insights to try to envisage better approaches.
The interesting thing is that this approach can be applied to just about anything, not just to ML or even programming tasks.
You could decide to improve a report you have written, some code, prompts for an LLM, a network of agents — anything really.
Where Karpathy used a well defined numerical test — the loss func — as a measure of fitness, in a general sense, you could use whatever you like, including evaluations from subagents. And you are not restricted to a single value, but can opt for a rubric of several.
For example, if I were writing a report, I might have a script that counts words, because the report has to be around some target length. But I also want the writing to be good, so I might dictate there should be other measures: an agent should evaluate how easy it is to read; another how factual it is; another the narrative and flow; another the suitability for purpose. All of these should be given equal weight, except the count, which should be more important.
The idea would be to give it my first attempt at the report (“good” version), and optimize to get “better” and “best”. It would iterate the loop, calling the sub-agents, running the count script, deciding how to make changes for the next round, etc.
I thought this pattern was worthy of encapsulation in a skill, so I developed “betterbest”.
I developed the skill using a bootstrap. I first developed a simplified version of it with claude, and then optimized the skill itself in a number of iterations using the previous generation of the skill.
And it won’t come as any surprise to learn that what you are reading is also the result of rounds of betterbest. I will provide the original notes I used as the seed below. You can see the ideas are mine, but the writing is generated using the skill.
Explain how you can install betterbest in claude code using the plugin commands.
Appendix B: the rubric that judged this post
Each round, the draft was scored by a panel and kept only if it beat the previous best. The rubric itself changed as the run went — dimensions were added, merged, or sharpened when they turned out to be measuring the wrong thing.
Two kinds of measure show up below. A hard gate is pass-or-fail: break it and the draft is thrown out no matter how well it does elsewhere. A soft dimension just adds to the overall score, so a weak showing on one can be outweighed by strength on the others. The final panel:
| # | Dimension | Type | What it judges |
|---|---|---|---|
| 1 | Length | hard gate | A word-count script. Stay in range; a runaway draft fails outright. |
| 2 | Factuality | hard gate | Every claim about autoresearch, Karpathy, betterbest, and the install commands checked against sources. No guessed links. |
| 3 | Voice & non-cringe | soft | Plain, warm, non-corporate — my voice, not a marketing team’s. (Merged mid-run from two overlapping judges.) |
| 4 | Concept clarity | soft | Are the core ideas (loop, artifact, fitness, rubric, the generalisation) clear and well stressed? |
| 5 | Narrative & flow | soft | Does it hook, flow, and land the recursion story? |
| 6 | Links | soft | The right references, well placed, none broken or gratuitous. |
| 7 | Non-coder accessibility | soft | A non-coder power user persona — do they follow it, or trip on jargon? (Added when a rubric audit caught the run optimising the wrong things.) |
| 8 | Seed fidelity | hard gate | Does the draft still carry the ideas in Appendix A, or has polishing sanded them away? |
| 9 | AI-trope tells | soft | A sharp hunter for em-dash overuse, “not X, it’s Y” cadence, and rule-of-three tics. |
| 10 | Publicity / pull | soft | Does it sell the idea and make you want to read on — without turning into a sales pitch? |
Dimensions 7 through 10 weren’t in my original plan. They were added partway through, when audits of the run (and a few notes from me) showed the panel was missing things or rewarding the wrong moves. That is the loop working on its own rules: even the rubric is something you can point it at.