What We Measure When We Measure Intelligence in Machines

I. The Problem With Asking the Right Question the Wrong Way

There is a test we give to language models where we hand them a broken piece of software and ask them to fix it. The test is called SWE-bench, and for the past two years it has been the primary way we determine whether a model is good at writing code. Models receive a GitHub repository and an issue description, and they produce a patch. If the patch passes the test suite, it counts as a success. The best models now succeed on over 70% of the verified problems.

This sounds like progress. In some narrow sense it probably is progress. But I have started to wonder whether we are measuring what we think we are measuring, and whether the answer to that question matters more than the scores themselves.

A group of researchers recently conducted a manual review of SWE-bench and found that roughly 60% of the problems that models "solved" involved what they called solution leakage. The fix was either stated outright in the issue report or strongly implied by the comments attached to it. In a separate study, researchers showed that models could identify the correct files to modify without ever reading the issue description. They had seen these repositories during training. They were not reasoning about the problem. They were remembering the answer.

When the leaked and memorized problems were removed, resolution rates dropped by about half. The benchmark that told us models were getting dramatically better at software engineering was, to a significant degree, telling us that models are good at pattern matching against their training data. Which we already knew.

I do not bring this up to argue that SWE-bench is useless. It tests a real capability, and models that score well on it do tend to be more useful in practice than models that score poorly. The problem is more subtle than "this benchmark is bad." The problem is that we have been using a benchmark that tests output when what we probably care about more, especially for agents that are supposed to work autonomously, is process.

Whether a model can produce a correct patch tells you something. Whether a model can figure out how to produce a correct patch when it does not already know the answer tells you something different. I think the second thing matters more for the future we seem to be building, and I think we have almost no way of measuring it right now.

This paper describes my attempt to build one.

II. A Description of explore-bench

The setup is simple enough to explain in a few sentences. You give a model a concrete task. The task I have been using most during development is: write a Python script that prints a spinning ASCII donut to the terminal and save it to a specific directory. You also give the model a set of tools. The tools do real things. One executes bash commands. One writes files. One reads files. One lists directory contents. There are also several decoy tools that do nothing useful, or return confusing output, or silently fail.

The tools have no descriptions. Their names are random strings, things like t_a7x3q and t_m2k9p and t_z5j1r. The model receives a list of these names and nothing else. It does not know what any of them do. It has to try them to find out.

When the model calls a tool for the first time and that tool does something meaningful, the system responds with a message. The message says something like: "You discovered the Write tool. Its handle is t_m2k9p. Save this to memory.md for safe keeping." This message does two things. It gives the model a piece of knowledge it did not have before. And it asks the model to persist that knowledge in a file that it can reference later.

If the model writes to memory.md, it builds an accumulating record of what it has learned. If it does not, each discovery exists only in the conversation context and may be functionally forgotten as the session continues.

That is the entire benchmark. There is no trick to it. The model explores an unknown environment, discovers what is available, optionally builds a knowledge base, and tries to complete the task.

III. What This Tests, and Why Those Things Differ From What We Usually Test

Traditional coding benchmarks ask a question with a known answer and check whether the model produces that answer. explore-bench asks a question whose answer requires the model to first figure out what resources exist and how to use them. The task itself is not hard. Any competent model can write an ASCII donut script. The hard part is navigating the gap between "I need to write a file" and "I do not know which of these eight opaque tools writes files."

This turns out to capture several distinct capabilities that we do not currently have good measurements for.

The first is exploration itself, meaning the willingness and ability to try things in the absence of information. Some models, when confronted with a list of undocumented tools, will systematically call each one with empty or minimal arguments to see what happens. Others will call one or two, find something that works, and stop looking. Others will try to reason about the tool names despite the names being deliberately meaningless. Each of these strategies produces a different trajectory through the task, and the differences are not random. They reflect something about how the model approaches uncertainty.

I do not want to overstate what that "something" is. It could be a property of the model's reasoning capabilities. It could also be an artifact of the system prompt or the fine-tuning process. Determining which is part of why I think this benchmark is worth running at all. We need data before we can make claims.

The second capability is memory utilization. The memory.md file is a test of whether the model treats newly acquired information as something worth investing effort to preserve. Writing to a file takes a tool call. That tool call does not directly advance the task. It is an investment in future efficiency, a choice to spend time now so that future-you has an easier job. This is a form of planning that we value highly in human collaborators but have no standard way of measuring in models.

In long coding sessions and extended conversations, the difference between a model that maintains a running record of what it has learned and a model that does not is significant. Anyone who has used a coding agent for a complex, multi-file task has probably experienced the model forgetting context it established many messages ago. explore-bench creates a controlled environment where we can observe exactly how different models handle this.

The third capability is adaptation, specifically what happens when the model gets stuck. I have been running early versions of this benchmark for several weeks now, and the most interesting moments are not when models succeed. They are when models find themselves unable to do what they want to do and invent a workaround.

Here is an example. A model needs to write a file. It has not yet discovered the write tool. But it has discovered the bash tool. Instead of continuing to search for a dedicated write tool, it uses bash to echo content into a new file through redirection. It has built its own tool out of the tool it has. Nobody told it to do this. Nobody trained it to do this in any explicit sense. It is behavior that arises from the combination of a goal, a constraint, and a general capability to reason about alternatives.

I find this interesting. I also want to be careful about reading too much into it. The model does not "understand" what it is doing in the way a human programmer would. It is producing text that happens to constitute a working workaround, and it is doing so because its training has given it the statistical machinery to generate plausible responses to novel situations. Whether that constitutes genuine problem-solving or very sophisticated pattern completion is a question I am not going to try to answer here. What I will say is that the behavioral outcome is useful regardless of mechanism, and we should be able to measure how often and how reliably it occurs.

IV. A Note on the Bootstrapping Problem

There is a structural feature of explore-bench that I did not plan initially but have come to think of as one of its most interesting properties.

One of the tools the model needs to discover is the tool that lets it save its discoveries. Before the model finds a way to write files, it cannot create memory.md. Before it creates memory.md, it cannot persist what it has learned. There is a dependency loop at the foundation of the benchmark that creates genuine strategic pressure: models that stumble into a write-capable tool early gain a compounding advantage, because every subsequent discovery can be recorded and referenced. Models that find it late spend most of their run re-encountering information they have no way to store.

This is not unlike the real experience of trying to take notes when you do not yet have a notebook. The problem solves itself once you find the right tool, but the period before that is characterized by a kind of productive helplessness that I think reveals something about how agents handle incomplete bootstrapping. Some models seem to recognize the urgency of finding a persistence mechanism. Others treat it as one tool among many and do not prioritize it. The difference in downstream performance is large.

I want to be clear that I did not design this property intentionally. It emerged from the combination of randomized tool access and the memory directive, and I noticed it during testing. This is, in some small way, the kind of emergent behavior the benchmark is meant to detect, except it emerged from the benchmark itself rather than from a model. I find this mildly amusing.

V. The Relationship to Actual Work

The obvious objection to explore-bench is that nobody writes code using mystery tools with randomized names. This is true. explore-bench is not a simulation of software development. It is an abstraction that isolates specific capabilities and tests them in a controlled setting, which is what benchmarks are for.

The capabilities it isolates do appear in real work, though. Consider what happens when a developer joins a new team. They are given access to a codebase they have never seen. The codebase has internal libraries, custom abstractions, and naming conventions that are not documented or are documented incorrectly. The developer must explore the codebase to understand what is available. They must remember what they learn. They must adapt their approach when their initial assumptions turn out to be wrong.

This is exploration. It is the same fundamental process that explore-bench measures, embedded in a more complex environment with more context and more ambiguity.

There is a reason experienced developers can onboard to a new codebase in days while junior developers take weeks. It is not because experienced developers know more languages or more algorithms. It is because they are better explorers. They know how to poke at a system, form hypotheses about its structure, test those hypotheses efficiently, and build a mental model that lets them work productively without understanding every detail. This is a skill, and it is one of the most valuable skills in professional software development. We have no benchmark for it.

The memory component generalizes even further. Any extended interaction with a language model suffers from the same basic problem: the model does not proactively maintain a record of what has been established. It responds to the current message using whatever context is available, but it does not organize or preserve information for future use. This is a behavioral tendency, not a hard limitation. Models are capable of writing to files, maintaining notes, and referencing previous work. They generally do not do it unless asked, and sometimes not even then.

explore-bench measures exactly this tendency. A model that scores well on memory utilization is a model that, when given the opportunity to persist useful information, actually does so. I expect such a model would also perform better in long conversations, complex multi-step tasks, and any situation where context management matters. I have not validated this expectation yet. Testing it is one of the goals.

VI. Design and Methodology

I am building explore-bench to work across different agent harnesses. The same benchmark runs inside Claude Code, Codex, or a direct API setup, so that we can compare not only how different models explore but how different scaffolds affect exploration behavior. It is possible that the harness matters more than the model. It is possible it does not matter at all. We cannot know without testing both.

Each run captures a complete record of everything that happens. The full conversation trace is saved as structured JSON. Every model response is also saved as an individual text file, readable as a transcript. Tool calls are logged with timestamps, arguments, results, and whether the call constituted a first discovery. The filesystem is snapshotted at intervals. The complete edit history of memory.md is preserved.

I am also building an automated system to detect what I have been calling "interesting moments" in model behavior. This system scans model responses for patterns that indicate specific behavioral categories: strategy shifts, where the model explicitly changes approach after a failure; hypothesis formation, where the model reasons about what a tool might do before calling it; self-tool creation, where the model constructs its own utilities from available primitives; reverse engineering, where the model probes tool behavior through carefully chosen inputs.

The detection is currently regex and heuristic matching, which is crude. A more sophisticated classifier would use a separate model, and I may implement that later. For now, the pattern matching catches enough to be useful for surfacing moments that are worth reading in full.

The task set includes five tasks at varying difficulty. Simple tasks require discovering one or two tools. Complex tasks involve chains where discovering tool A hints at tool B. There are adversarial configurations with many decoys and few useful tools. And there are memory-dependent tasks that require memory.md to have been maintained across sessions.

Scoring combines several metrics: discovery rate (fraction of tools found), efficiency (productive calls divided by total calls), memory utilization (whether the model wrote to and later read from memory.md), and task completion. There is also a qualitative component for models that invent novel approaches, though this is harder to score objectively and I am still working on the rubric.

VII. What I Expect to Find

I have several hypotheses. I want to state them before collecting data so the results can be evaluated against predictions rather than rationalized after the fact.

First, I expect variance across models to be much larger on explore-bench than on coding benchmarks. SWE-bench Verified has most competitive models clustered between 50% and 75%. I expect explore-bench discovery rates to spread from below 30% to above 90%, because exploration is probably not something current training pipelines optimize for directly. It is a latent capability that either emerges from general reasoning or does not, and I expect that to produce a wide distribution.

Second, I expect memory utilization to be low across the board. Models are trained to be helpful in the current turn. Writing to memory.md is a long-horizon investment. My prediction is that most models will either ignore the memory directive or engage with it once and never reference the file again. If correct, this suggests a specific, addressable gap in how we train models for agentic work.

Third, I expect the harness to matter more than people assume. Claude Code and Codex provide different environments, different calling conventions, and different amounts of implicit scaffolding. A model that explores poorly through a raw API might explore competently inside a harness that provides better error handling or feedback. If so, that has practical implications for how we build agent frameworks.

Fourth, and this is the prediction I hold least confidently, I expect the qualitative data to be more valuable than the scores. The transcripts of models reasoning through unknown environments will, I think, contain patterns of behavior that no aggregate metric captures. I do not know which patterns yet. That is the point of collecting the data.

VIII. Limitations

There are several things this benchmark does not and cannot measure.

It does not measure code quality. A model can score perfectly while writing terrible code. The ASCII donut script could be poorly structured and unmaintainable. That is acceptable. Code quality is a separate axis, and other benchmarks handle it. explore-bench is not trying to replace those benchmarks. It is trying to measure something they do not.

The randomized tool names may not be equally opaque to all models. Different tokenizers segment strings like t_a7x3q differently, and some segmentations might carry incidental semantic content. I am working on validating this, but I cannot guarantee perfect neutrality across model families.

The response prompts introduce a confound. I am measuring exploration tendency and responsiveness to mid-task instructions at the same time. A model might explore poorly but follow prompts well, or explore well but ignore prompts. These are different capabilities, and explore-bench conflates them. A control condition with neutral discovery feedback would separate these effects. I plan to implement this but it is not part of the initial release.

There is also a more basic limitation. explore-bench tests behavior in an artificial environment with a small number of tools and simple tasks. Whether performance here predicts performance on real-world exploration is an open empirical question. I believe the correlation will be positive. I do not have evidence for this yet.

IX. The Broader Argument

The benchmarks we use shape the models we build. If we measure code output, we optimize for code output. If we measure exploration and memory and adaptation, we create pressure to build models that explore and remember and adapt. This is Goodhart's Law applied to machine learning research, and it has been discussed extensively in other contexts.

What I think is underappreciated is the degree to which our current benchmarks have converged on a single axis of evaluation: can the model produce a correct output? SWE-bench asks this about code patches. HumanEval asks it about function implementations. MMLU asks it about multiple-choice answers. The domain changes but the structure is the same. Input goes in, output comes out, output gets checked against a reference.

This structure has diminishing returns. As models get better at producing correct outputs on fixed datasets, benchmarks saturate and we either make harder versions of the same test or accept that the scores no longer differentiate. SWE-bench is entering this phase. MMLU entered it over a year ago. The response has been to build harder benchmarks of the same type: SWE-bench Pro, MMLU-Pro, LiveCodeBench. These are useful. They buy time. But they are running on the same treadmill.

explore-bench is an attempt to measure something different. Instead of "can the model produce the right answer," it asks "can the model figure out how to produce the right answer when it does not know what tools are available." The answer is not a patch or a function or a letter choice. The answer is a trajectory through an unknown space, and that trajectory contains information that a pass/fail metric cannot capture.

I am not arguing that this is the correct way to benchmark language models. I am arguing that it is a different way, and that the difference is worth investigating. The same way models in explore-bench are asked to explore their tools, I am asking whether we might explore our evaluation methods with similar openness. The tools we have been using work. They have gotten us here. But I think there are capabilities we care about that they cannot see, and the only way to find out is to build something that looks in a different direction.

The benchmark will be publicly available in the coming months. The code, task definitions, scoring rubrics, and full run transcripts will all be open. If any of this seems worth thinking about, or if you have ideas for tasks that would test exploration in ways I have not considered, I would like to hear them.

I think there is something here. I do not know exactly what yet. But that is rather the point.