Agentic Experiments

Last week Karpathy open-sourced a repo called autoresearch. The premise is simple: you give an AI agent a training script and a performance metric, and you let it run overnight. It edits the code, trains for five minutes, checks if the result improved, keeps or discards, and repeats. The entire codebase is about 630 lines across three files. Karpathy pointed it at his nanochat setup, and over two days the agent ran around 700 experiments and found roughly 20 real improvements that, stacked together, cut training time by about 11%. Tobi Lutke at Shopify did something similar and got a 0.8B parameter model to outperform his previous 1.6B.

I've been thinking about this idea from a different angle. Autoresearch is about optimizing a training loop. The agent edits one file, trains, evaluates, and loops. It's clean and it works because the problem is tightly scoped: one metric, one file, five-minute iterations. But the thing I'm interested in is whether an agent can do something harder than tuning hyperparameters. Specifically, whether it can recognize problems in its own environment and tooling and then reliably fix them.

That's the motivation behind a project I'm calling Agentic Experiments.

What it is

The setup is a repository organized as a sequence of numbered experiments. Each experiment gets its own directory with its own documentation, implementation, evaluation assumptions, and benchmark results. The first experiment, exp-01, is purely foundational. It defines the project structure, the proposal, the spec, and the research framing. No code has been written yet. That's intentional.

The system I'm building is a CLI-first coding agent with a minimal frontend and a backend that handles planning, tool orchestration, execution control, memory, and evaluation. The backend is where all the actual research value lives. The frontend is basically just enough to invoke the thing and look at what it did. I'm writing it in Python because the early work is going to be orchestration-heavy (subprocess management, benchmark scripting, configuration schemas, result logging) and Python is the fastest path to iterating on all of that without fighting the language.

The evaluation target is a subset of about 50 tasks from TerminalBench and SWE-Bench. If results indicate real improvement at that scale, I expand by another 25 tasks. The expansion is conditional because I don't want to mistake overfitting on a narrow set for actual capability gains.

How it's different from autoresearch

Karpathy's system optimizes within a fixed, well-defined loop. The agent is editing training code to minimize validation loss. The search space is bounded, the metric is clean, and the iteration cycle is five minutes. That's a very productive setup, but it's also a very constrained one. The agent doesn't need to understand its own tooling. It doesn't need to diagnose why a subprocess failed or figure out that its environment is misconfigured or decide that it needs a different approach to file modification. It just needs to make the number go down.

What I'm trying to do is broader and less clean. I want to see if a coding agent can look at a full software engineering task (the kind that shows up on SWE-Bench, where you're given a real repo and a real issue and you have to produce a real patch), and not just attempt the task, but also notice when something about its own execution pipeline is broken and fix that too. The self-improvement isn't about finding better hyperparameters. It's about an agent recognizing that its tool invocation is flawed, or its planning strategy doesn't work for a certain class of problem, or its memory policy is causing it to lose context at the wrong moment.

The experiment lineage system is the mechanism for tracking this. Each experiment variant gets evaluated. The strongest variant becomes the parent of the next generation. If exp-23 turns out to be the best performer, the next round of experiments would be exp-23-a, exp-23-b, exp-23-c, each varying one thing (prompting strategy, memory policy, tool policy, planning heuristics), and only the strongest descendant continues. It's a manually supervised evolutionary process. Later, if an agent variant is strong enough, it gets to participate in building its own successors. That's the self-building phase, and I don't expect to reach it quickly.

What I think will happen

I don't know what will happen. That's the honest version. But I have some guesses.

I think the first several experiments will be underwhelming. Getting the scaffold right, getting the benchmark harness working, getting the evaluation pipeline to produce repeatable results, all of that is going to eat most of the early effort. exp-01 is literally just documentation. The first experiment that actually runs code will probably fail at a bunch of tasks for boring reasons: bad subprocess handling, context window overflow, the agent misunderstanding the repo structure. That's fine. Those boring failures are the data.

I think the interesting part starts when I have enough failed runs to look at the patterns. If the agent consistently fails at tasks that require modifying multiple files, that tells me something about the planning module. If it consistently fails at tasks that require understanding test output, that tells me something about the observation pipeline. The benchmark isn't just measuring how good the agent is. It's diagnosing where the agent is broken. And the experiment lineage is the mechanism for acting on that diagnosis.

I think the self-building phase, where a strong agent variant helps construct the next variant, is further out than I'd like it to be. Karpathy's system gets to self-improvement fast because the loop is tight and the problem is narrow. My problem is wider, and the feedback cycle is slower, and there are more things that can go wrong in ways that are hard to attribute to a single cause. An agent that fails a SWE-Bench task might have failed because of a bad prompt, or bad tool use, or bad memory, or because the task was just hard. Disentangling those factors is going to be the main methodological challenge.

I also think there's a real chance that the experiment branching gets unmanageable. The proposal says "the branch continues only through the strongest measured descendant," which sounds clean on paper, but in practice deciding which descendant is strongest when you're comparing across 50 tasks with noisy results and different failure modes is not going to be straightforward. I'll probably need some kind of aggregate scoring system, and designing that system is itself an experiment in what you choose to optimize for.

The outcome I'd consider a success for exp-01 specifically is pretty modest: a working scaffold, a benchmark harness that actually runs, and a first set of baseline measurements that I can compare future experiments against. If I get that, the project has legs. If I don't, I'll know pretty quickly.

Why I'm writing about it now

The project is in its earliest possible state. The documentation exists, the repo exists, but no agent code has been written. I'm writing about it now because I think the framing matters more than the implementation at this stage, and because the framing is the part I'm most interested in getting right.

Most coding agent projects I've seen start with the implementation and figure out the evaluation later, if they figure it out at all. I'm trying to do it the other way around: define the evaluation methodology first, define the promotion criteria first, define what counts as improvement first, and then build the thing that gets evaluated. The agent is the experiment. The benchmark is the instrument. The experiment lineage is the lab notebook.

I don't know if this approach works. I don't know if benchmark-driven iteration produces agents that are meaningfully better at real tasks or just agents that are good at benchmarks. I don't know if the self-building phase is achievable or if it's just a nice idea that falls apart when you try to operationalize it. But those are empirical questions, and the whole point of structuring the project this way is to make them answerable.

I'll write more when there's something to measure.

In the meantime, check out the repository on GitHub @ https://github.com/kcodes0/agentic-experiments