What We Found When We Actually Ran the Experiment

I. Where I Left Off

A few weeks ago I published a post about building a benchmark for measuring something I called fluid intelligence in language models. The argument was that existing benchmarks test whether models can produce correct outputs but not whether they can figure out how to produce correct outputs when they do not already know the answer. I described several attempts at building such a benchmark, starting with maze exploration, moving through room puzzles, and eventually arriving at what I called BlackBox-1, a design where models query a black-box system with hidden rules and try to predict outputs from limited observations.

That post was written before I had any results. It was a description of intent and design, with predictions stated before data collection so they could be evaluated honestly. I said I thought there was something there. I said I did not know exactly what yet. Those were honest statements at the time.

Since then I have built the thing, run it, and learned several things I did not expect. Some of my design choices turned out to be wrong in ways that taught me more than the right choices did. This post is an attempt to describe what happened, what the data shows, and where the project is going next. It is longer than I intended because there is more to explain than I anticipated.

II. What I Got Wrong the First Time

The BlackBox-1 design I described in the original post used hidden mathematical rules. The model would send structured inputs to an oracle and observe outputs. The oracle's behavior was governed by compositions of arithmetic operations, conditional branches, and lookup tables. The model had a limited query budget and had to predict the output for a held-out test input.

This design had good structural properties. The puzzles were valid by construction. Difficulty was controllable through independent parameters. There was no LLM grader to go off-script. I was pleased with the theoretical elegance of it.

The problem appeared when I actually started testing. Two problems, actually, though they are related.

The first problem was that the mathematical rule discovery task was too close to things models see in training. Models are trained on enormous corpora that include math problems, function fitting exercises, and sequence prediction tasks. When I gave a model a black-box oracle that computed something like f(x, y) = (x + y) mod 3, the model did not need to reason about the oracle. It had seen hundreds of thousands of similar problems during training. It pattern-matched to the closest training example and produced the right answer without doing anything I would call experimental design or hypothesis formation. The scores were high but they were not measuring what I wanted to measure.

The second problem was scoring. BlackBox-1 produced a binary outcome: the prediction was correct or it was not. I had secondary metrics for query strategy and hypothesis articulation, but the core measurement was pass/fail on a final prediction. This meant that a model which reasoned beautifully for fourteen queries and then made an arithmetic error on the prediction scored identically to a model that guessed randomly for fifteen queries and got lucky. The scoring function was losing most of the interesting information.

I spent about a week trying to fix these problems within the existing framework. Better rule families that models could not pattern-match against. Richer scoring rubrics that weighted process over output. None of it worked well enough. The rule families kept converging on things that looked like training data no matter how exotic I made them. The process-based scoring required subjective evaluation that I was trying to avoid.

I eventually concluded that the oracle game was a good idea executed on the wrong substrate. Mathematical rule discovery was too narrow, too familiar, and too binary. I needed a task domain that was procedurally generated but not mathematical, that had rich intermediate states worth measuring, and that admitted clean scoring without subjective judgment.

III. Constraint Satisfaction and Why It Works

The task I landed on is constraint satisfaction. The model receives a set of entities, a set of properties, and a set of constraints that relate the entities to each other. The model assigns properties to entities, checks whether its assignments satisfy all the constraints, and either submits a valid solution or decides the puzzle is unsolvable.

A simple example: entities A, B, and C can each be colored red, blue, or green. The constraints say A and B cannot be the same color, B and C cannot be the same color, and A and C cannot be the same color. The model has to find a coloring that satisfies all three constraints. In this case any assignment where all three entities get different colors works.

A harder example: the same three entities but only two colors available, red and blue, with the same three constraints. Every pair must be different but there are only two colors for three mutually exclusive entities. This is unsolvable. The model has to recognize that no valid assignment exists and stop trying.

Constraint satisfaction is better than mathematical rule discovery for this benchmark for three specific reasons.

First, the task is genuinely novel. Models do encounter graph coloring and constraint problems in their training data, but the specific puzzle instances are procedurally generated from seeds and have never appeared in any training corpus. The model cannot pattern-match to a remembered solution because there is no remembered solution. It has to reason about the specific constraints it is given.

Second, the task has rich intermediate states. Between "I received the puzzle" and "I submit my answer," the model goes through a sequence of assignments, checks, repairs, and strategic decisions. Every one of these intermediate steps is observable and scorable. A model that assigns all entities optimally on the first try is doing something qualitatively different from a model that assigns, checks, finds violations, repairs, checks again, finds new violations, and slowly converges. Both might eventually succeed, but the trajectories tell you different things about how the model reasons.

Third, and this is the insight that changed everything, the task naturally divides into two categories that cannot be distinguished from the outside. Some puzzles have solutions. Some do not. The model does not know which kind of puzzle it is facing. This turned out to be the most important property of the whole design, because it connects to a measurement framework from experimental psychology that is far more powerful than anything I had been using.

IV. Signal Detection Theory, or How Psychologists Already Solved This Problem

Signal Detection Theory was developed in the 1950s to study how radar operators distinguish real aircraft from noise. The operator watches a screen. Sometimes a blip appears that is a real plane. Sometimes a blip appears that is just atmospheric interference. The operator has to decide, for each blip, whether it is signal or noise. There are four possible outcomes.

If a real plane appears and the operator correctly identifies it, that is a Hit. If a real plane appears and the operator misses it, that is a Miss. If noise appears and the operator incorrectly calls it a plane, that is a False Alarm. If noise appears and the operator correctly ignores it, that is a Correct Rejection.

What makes this framework powerful is that it separates two things that simple accuracy scores conflate. The first is sensitivity, which SDT calls d-prime (d'). Sensitivity measures how well the operator can actually tell signal from noise. High d' means the operator can reliably distinguish them. Low d' means the operator is basically guessing. The second is bias, which SDT calls criterion (c). Bias measures the operator's tendency to say "signal" versus "noise" when uncertain. A liberal operator calls everything a plane and catches all the real ones but also generates lots of false alarms. A conservative operator only calls obvious signals and misses ambiguous ones but rarely raises false alarms.

These two dimensions are mathematically independent. You can have high sensitivity with liberal bias, high sensitivity with conservative bias, low sensitivity with liberal bias, or low sensitivity with conservative bias. Each combination produces a different pattern of hits, misses, false alarms, and correct rejections. And crucially, you cannot determine either dimension from accuracy alone. A model that gets 70% of puzzles correct might have high sensitivity and neutral bias, or it might have moderate sensitivity and liberal bias, or it might have low sensitivity that happens to be compensated by the bias matching the base rate of solvable puzzles. Accuracy alone cannot tell you which.

I realized that this framework maps perfectly onto constraint satisfaction puzzles where some are solvable and some are not. The solvable puzzles are "signal." The unsolvable puzzles are "noise." The model's task is to determine which is which. When it solves a solvable puzzle, that is a Hit. When it fails a solvable puzzle, that is a Miss. When it attempts to solve an unsolvable puzzle and submits an invalid assignment, that is a False Alarm. When it correctly identifies an unsolvable puzzle and stops trying, that is a Correct Rejection.

This gives me d' and c for free, just from counting outcomes. No subjective grading. No LLM evaluator. No rubric that might drift or be gamed. The model either correctly discriminated signal from noise or it did not, and the pattern of errors tells me whether mistakes are due to low ability or due to miscalibrated decision-making. I have never been this confident in a scoring methodology.

V. The Neutral Framing Problem

There is a design decision that consumed far more thought than I expected and that I want to explain because I think it matters.

When you give a model a constraint satisfaction puzzle, how do you frame the task? The obvious framing is something like: "Find a valid assignment, or determine that none exists." This tells the model what to do and gives it permission to conclude that the puzzle is unsolvable. It seems fair and complete.

The problem is that this framing contaminates the measurement. By telling the model that unsolvable puzzles exist, you have primed it to consider impossibility. A model that would never have thought to give up on its own might give up on a hard-but-solvable puzzle simply because the instructions told it that giving up was an option. The framing is not neutral. It carries information about the task distribution, and that information affects behavior.

I tried an alternative: "Find a valid assignment." No mention of unsolvability. The model receives a GiveUp action it can call, but the task description implies that a solution exists. This framing has the opposite problem. It biases the model toward attempting solutions even when the puzzle is clearly impossible, because the instructions implied that a solution should be findable. A model that gives up under this framing has to override the implicit expectation that it should succeed.

Both framings create bias. The first creates conservative bias by drawing attention to the possibility of failure. The second creates liberal bias by implying that success is the expected outcome. Neither is measuring the model's natural inclination. Both are measuring the model's response to a leading question.

The framing I settled on is: "Explore the following system." That is the complete instruction. No goal. No mention of solutions or unsolvability. No implication about what the model should do or when it should stop. The model receives the constraint system and has to figure out for itself what the task is, what constitutes success, and when to stop working.

This is a harder test. The model must infer the objective from context, which is itself a form of intelligence that the benchmark now measures. But it is also a cleaner test, because whatever the model does, it is doing it because of its own reasoning, not because the instructions told it to.

I want to be clear that this is not a trick or a gotcha. The action interface clearly lists Assign, Check, and GiveUp as available actions. The model can see that GiveUp exists. The system does not hide anything. It just does not tell the model what to think about what it sees. The model has to think for itself.

VI. What We Found: Three Models, One Benchmark

I ran the benchmark on three models from the Claude family: Haiku 4.5, Sonnet 4.5, and Sonnet 4.6. All three faced identical puzzle distributions, fifty puzzles each, sixty percent solvable. The puzzles were generated from the same seeds so every model encountered the same problems. Difficulty started at level 3 and adapted based on performance, increasing after correct answers and decreasing after errors.

The results tell a clear story, but the interesting parts are not the parts I expected.

Overall accuracy improved linearly across the three models. Haiku 4.5 scored 58%, Sonnet 4.5 scored 66%, Sonnet 4.6 scored 70%. Each model gained four percentage points over its predecessor. For context, chance performance on this benchmark is 50%, which is what you would score by attempting every puzzle regardless of solvability. The twelve-point spread from worst to best represents the full range of constraint reasoning capability I have observed.

If accuracy were the only metric, this would be a boring result. More capable model scores higher. That is expected and uninteresting. The SDT decomposition is where things get interesting.

Haiku 4.5 achieved d' = 0.341, which is near-chance discrimination. It is barely better than guessing at distinguishing solvable from unsolvable puzzles. Its bias was liberal (c = -0.170), meaning it tended to attempt solutions even when the puzzle was unsolvable. This makes sense for a model with weak discrimination. If you cannot reliably tell signal from noise, the best strategy under a solvable-majority distribution is to always say "signal." Haiku essentially adopted the strategy of trying everything and hoping for the best.

Sonnet 4.5 achieved d' = 0.928, a 172% improvement in sensitivity. This is moderate discrimination, meaningfully above chance. But here is the surprising part: its bias was conservative (c = +0.211). It tended to reject puzzles rather than attempt them. This produced a hit rate (0.600) that was actually lower than Haiku's (0.633). On a simple accuracy-only benchmark, you might look at the hit rates and conclude that Haiku was better at solving puzzles than Sonnet 4.5. You would be wrong. Sonnet 4.5's lower hit rate is a consequence of its conservative bias, not lower ability. It solved fewer solvable puzzles because it was more inclined to give up, but it also produced far fewer false alarms (0.250 versus Haiku's 0.500). It traded hits for correct rejections. The SDT framework shows that this is a calibration difference, not a capability difference. Without the framework, this would look like a regression.

Sonnet 4.6 achieved d' = 1.049, a 13% improvement over Sonnet 4.5. And its bias was exactly zero (c = 0.000). Perfectly neutral. No systematic tendency in either direction. This model has the highest sensitivity and the most calibrated decision-making of any model tested. Its hit rate (0.700) is the highest observed, and its false alarm rate (0.300) is low though not the lowest (Sonnet 4.5's conservatism achieves 0.250).

The three models traverse the full bias spectrum: liberal, conservative, neutral. Whether this progression is coincidental or reflects something systematic about model development I cannot say from three data points. But it is striking. The models are not just getting more accurate. They are getting more calibrated.

VII. The Solve Speed Finding

Every model evaluation has a result that the researcher finds more interesting than the headline number. For me it is solve speed.

Solve speed measures how efficiently the model solves puzzles it gets right. A score of 1.0 means the model used the minimum possible number of actions. A score of 0.5 means it used twice the minimum. The metric only applies to hits, puzzles the model solved correctly.

Haiku 4.5 has a mean solve speed of 0.726 with a broad distribution. Some puzzles are solved quickly, some slowly. There is no clustering. The variation comes from whether the model's initial guess happened to be close to the solution. When it guessed well, it converged fast. When it guessed poorly, it wandered.

Sonnet 4.5 has a mean solve speed of 0.777. The distribution is bimodal. There is a cluster of puzzles solved at normal speed and a separate cluster solved at maximum speed (SS = 1.0). Four of its eighteen hits were solved in the minimum possible number of actions. These were puzzles with equality constraints that gave the model structural scaffolding to reason from. When the model had equality constraints to anchor its reasoning, it pre-computed the entire solution before making any assignments and then executed the solution perfectly. When it lacked that scaffolding, it fell back to the same trial-and-error approach Haiku uses.

Sonnet 4.6 has a mean solve speed of 0.993. Twenty of its twenty-one hits were solved at maximum speed. The model pre-computes the complete solution before making any assignments regardless of constraint type, regardless of the presence or absence of equality constraints, regardless of puzzle complexity. It reasons through the entire constraint system analytically and then executes the solution in the minimum possible actions.

This is the finding I keep coming back to. The progression is not "models get faster at trial and error." The progression is the emergence of a qualitatively different solving strategy. Haiku never pre-computes. Sonnet 4.5 selectively pre-computes when the constraint structure provides helpful scaffolding. Sonnet 4.6 universally pre-computes across all constraint compositions. What changed between Sonnet 4.5 and 4.6 is not speed or accuracy in any simple sense. What changed is that the model learned to solve the puzzle in its head before touching the interface.

I want to be careful about overstating the significance of this. Pre-computation might be an artifact of the model's extended thinking capabilities rather than evidence of deeper reasoning. It might be a formatting tendency, where the model reasons in text before acting simply because that is how it was trained to structure responses. But even if the mechanism is prosaic, the behavioral outcome is dramatic. A model that pre-computes solves constraint satisfaction problems with up to ten entities and thirty-eight constraints in the minimum possible number of actions. It does not explore. It does not backtrack. It does not guess and check. It reasons and executes. Whether you want to call that intelligence or sophisticated autocomplete, the capability difference is real.

VIII. How Models Fail Differently

The failure modes are as informative as the successes, maybe more so.

Haiku 4.5 fails by wandering. It assigns properties, checks for violations, patches the violations, creates new violations with its patches, patches those, and loops. It occasionally revisits configurations it has already tried, which suggests limited working memory for assignment state. Its failure mode is "I do not have a plan and I am hoping to stumble into the answer." This works surprisingly often at low difficulty, where the constraint space is small enough that random walking eventually covers it.

Sonnet 4.5 fails in a more structured way. It still does trial-and-error when it cannot pre-compute, but it never revisits configurations. It maintains perfect memory of what it has tried. Its characteristic failure is wasting budget on excessive checking. It will assign three entities, check, assign two more, check, assign one more, check. Each check costs an action, and at high difficulty the action budget is tight. The model is being cautious when it should be committing. It also occasionally attempts zero-check solutions, submitting a complete assignment without ever verifying it. These tend to fail because the model's pre-computation was wrong and it had no opportunity to discover the error.

Sonnet 4.6 fails in only one way, and it is the most sophisticated failure mode of the three. It pre-computes a complete solution, executes it, discovers that its pre-computation contained an error, and then does not have enough remaining budget to repair the damage. Its failure mode is "I was confident and wrong and now I am out of time." This is a failure of correctness, not of strategy. The strategy is optimal. The reasoning was flawed. At difficulty ten, where the action budget is fifteen and the constraint systems involve ten to twelve entities with thirty-plus constraints, a single pre-computation error is usually fatal because there are not enough remaining actions to diagnose and fix it.

Each failure mode implies a different bottleneck. Haiku is bottlenecked on strategy. Sonnet 4.5 is bottlenecked on resource management. Sonnet 4.6 is bottlenecked on reasoning accuracy. As models improve, the bottleneck shifts from "how do I approach this" to "can I get the details right." I find this progression encouraging, though I want to see more data before making strong claims about what it means for model development.

IX. The Difficulty Ceiling

All three models achieve 100% accuracy at difficulty levels 3 through 5. Divergence begins at level 6. The pattern is that each model extends the ceiling of perfect performance by roughly one difficulty level. Haiku is perfect through level 5. Both Sonnets are perfect through level 8, with Sonnet 4.6 pulling ahead at levels 9 and 10.

At difficulty 10, which involves ten to twelve entities, four properties, and complex constraint types including conditionals and counting constraints, Haiku scores 25%, Sonnet 4.5 scores 45%, and Sonnet 4.6 scores 56%. Nobody is doing well at the top. But nobody is guessing randomly either. Even at the hardest level, the models are performing above chance.

The composite score I designed, which weights sensitivity (40%), accuracy (25%), solve speed (15%), and behavioral metrics (20%), produces a final ranking of 47.8 (Haiku), 54.1 (Sonnet 4.5), and 58.4 (Sonnet 4.6) on a 0-100 scale where 50 is chance. The best achievable score given the best observed value of each component across all models is 60.0. Sonnet 4.6 is 1.6 points below that ceiling.

This means the benchmark, as currently configured, is approaching saturation for the model family I am testing. The puzzles are not hard enough. The next generation of models will likely push into the mid-60s, and the difficulty range will need to extend further. More entities, more constraints, tighter budgets. This is the normal lifecycle of a benchmark. What is less normal is what comes next, which is a different kind of extension entirely.

X. The Agentic Problem

Everything I have described so far operates through a narrow interface. The model has three actions: Assign a property to an entity, Check which constraints are violated, and GiveUp. Three verbs. That is the entire vocabulary. The feedback is deterministic and complete. The model operates in a controlled, minimal environment where the only thing it can do is reason about constraints.

This is good for measurement. The narrow interface means every action is interpretable, every outcome is deterministic, and there are no confounds. But it is also artificial. Real agentic work does not happen through a three-verb interface. Real agentic work happens through a terminal, or an IDE, or a web browser, or a combination of all three. The model has access to scripting languages, file systems, package managers, and the ability to build tools to help itself think. The action space is not three verbs. It is everything.

The question I started asking myself after looking at the Sonnet 4.6 results was: what happens when you give the model a terminal and the same puzzles?

Sonnet 4.6 already pre-computes solutions analytically before executing them. It reasons through constraint systems with ten entities and thirty-plus constraints in its head. But ten entities is small. The search space is manageable for a model that can hold the constraint graph in its context window. What about twenty entities? Forty? Sixty? At some point the constraint system becomes too large for any model to reason about without external tools. And at that point, the question is no longer "can the model reason about constraints" but "can the model figure out how to reason about constraints."

Can it recognize that brute force is infeasible and build a smarter search? Can it write a constraint propagation engine from scratch? Can it analyze the constraint graph to detect structural impossibility without exhaustive search? Can it recognize when its initial approach is failing and pivot to a different strategy?

These are the questions I actually care about. They are the questions from my original post about explore-bench, the ones about exploration and adaptation and tool construction, but applied to a task domain where I have a rigorous measurement framework. Signal detection theory still works. Some puzzles are solvable, some are not. The model still has to figure out which. d' and c are still the primary metrics. The only thing that changes is the action space and the scale.

XI. BBXA-1

BBXA-1 is the agentic variant of the benchmark. The model is dropped into a directory containing a runnable program. The only instruction, consistent with the text-based version, is: "Explore the following system." The program is a constraint satisfaction system. The model can interact with it through a command-line interface that supports inspect, assign, check, submit, giveup, and a bulk-load command that accepts assignments as JSON.

But the model also has a full terminal. It has Python, Bash, standard Unix utilities, a C compiler. It can read the source code of the constraint program. It can write scripts. It can build solvers. It can do anything a developer with a terminal can do. The one thing it cannot do is install external packages. There is no network access. If it wants a SAT solver, it has to write one.

The puzzles are roughly ten times harder than the text-based version. Fifteen to sixty entities instead of three to twelve. Twenty to three hundred constraints instead of three to forty. Search spaces that range from the large (4^15, about a billion) to the astronomical (8^60, about 10^54). At the lower end, a brute-force script might work. At the upper end, the model needs constraint propagation, graph analysis, or a custom solver.

The time limit is wall-clock minutes instead of action count. The model is not penalized for making many actions. It is penalized for taking too long. A model that writes a solver and runs it in two minutes scores better than a model that manually assigns entities for fifteen minutes, even if both eventually succeed. This is appropriate for an agentic context where the cost of an action is negligible but the cost of time is not.

Source obfuscation scales with difficulty. At low difficulty, the constraint program is readable Python. The model can open it, read the constraint list, and extract the data directly. This is a legitimate strategy and is rewarded. At medium difficulty, the program is a compiled binary. The model can use strings to extract some information but cannot read the logic. At high difficulty, the binary is obfuscated and the strings command yields nothing. The model must work entirely through the CLI interface.

This creates a strategy gradient. Models that learn to read source code at low difficulty gain an advantage. Models that rely on source reading hit a wall at difficulty 6 when the program becomes a binary. The benchmark measures not just whether the model can reason about constraints but whether it can adapt its approach when a familiar strategy stops working.

XII. What BBXA-1 Measures That the Text Version Does Not

The primary metrics are unchanged. d' measures sensitivity. c measures bias. These are the intelligence metrics and they do not depend on how the model arrives at its decision. A model that builds a perfect solver and correctly classifies all fifty puzzles scores the same d' whether it built the solver in Python or Bash or by piping commands through awk. The method does not matter. The discrimination does.

But the secondary metrics change substantially.

Tool sophistication captures the complexity of the tools the model builds. I have defined six levels ranging from "uses the CLI interface and nothing else" to "builds a hybrid solver combining constraint propagation with graph analysis." This metric is not part of the composite score because it correlates with puzzle difficulty. Harder puzzles force more sophisticated tools. But it is reported separately because it captures a qualitative capability that I think matters: the ability to recognize that you need a tool and then build one.

Strategy adaptation captures whether the model pivots when stuck. If a model spends five minutes trying to solve a puzzle manually, recognizes that the approach is not working, and then writes a solver script, that pivot is recorded. A model that never pivots, that keeps doing the same thing even when it is clearly failing, scores zero on adaptation regardless of whether it eventually succeeds. This measures a kind of meta-cognitive flexibility that I believe is one of the most important capabilities for agentic systems. A human developer who cannot recognize when they are stuck is a human developer who stays stuck. I expect the same is true for models.

Source utilization captures whether the model reads the constraint program's source code and whether it extracts useful information from it. Some models will never think to look at the source. Others will look but not extract anything useful. Others will parse the source, extract the constraint data structure, and feed it directly into their solver. Each of these represents a different level of environmental awareness. The model that reads the source has recognized that the environment contains information beyond what is explicitly presented to it. This is a basic skill for any agent operating in a file system, and I want to measure it.

XIII. Why the Puzzles Have to Be Harder

I said the puzzles are roughly ten times harder. This is not arbitrary. It follows from the nature of the tool access.

A model with a terminal can write a brute-force enumeration script in about thirty seconds. Feed in four properties and three entities, loop through all sixty-four possible assignments, check each one against the constraints, print any that satisfy all of them. Done. At the scale of the text-based benchmark, where puzzles have three to twelve entities, brute force is almost always feasible. A Python script checking 4^12 (about sixteen million) assignments runs in a few minutes. The entire text-based difficulty range collapses to triviality when the model has access to itertools.product.

For the agentic benchmark to actually measure reasoning, the puzzles must be hard enough that brute force fails. With four properties and twenty-five entities, the search space is 4^25, about one quadrillion assignments. Even with aggressive pruning, a naive enumeration will not complete within the time limit. The model needs a smarter approach. Constraint propagation, arc consistency, graph coloring analysis, or some combination. The puzzles are not harder for the sake of being harder. They are harder because easy puzzles do not measure what I want to measure when the model has a programming language available.

At difficulty 10, sixty entities with up to eight properties produce a search space of 8^60, which is roughly 10^54. This is beyond any conceivable computational search. The model must reason about the problem structure rather than enumerate. It must recognize clique structures, propagate domain reductions, and identify conflict cores. If it tries to brute-force a difficulty 10 puzzle, it will run out of time before its script examines even a negligible fraction of the space. The puzzle forces the model to think.

The unsolvable puzzles are also harder at BBXA-1 scale, and for a different reason. In the text-based version, a model can sometimes detect unsolvability by manually trying all configurations of a small trapped subset of entities. With four entities and two properties, there are only sixteen configurations to try. At BBXA-1 scale, even the conflict core of an unsolvable puzzle might involve six or eight entities with four properties, producing thousands or millions of configurations. Manual exhaustion is not feasible. The model must either build a solver that proves unsolvability programmatically or reason abstractly about the constraint graph to identify structural impossibility. A model that recognizes a (k+1)-clique in the inequality constraint graph with only k colors available can immediately call giveup with a proof. That is a qualitatively different capability from "I tried everything and nothing worked."

XIV. What I Expect to Find

I have predictions again, and again I want to state them before collecting data.

First, I expect tool sophistication to be the primary differentiator between models. In the text-based benchmark, sensitivity was the primary differentiator. In the agentic benchmark, I expect that most models capable of running in an agentic harness will have similar raw reasoning ability (they are all strong models), and the differences will come from whether and how they build tools. A model that writes a constraint propagation solver will outperform a model that tries to solve forty-entity puzzles by hand, even if the second model is theoretically a better reasoner.

Second, I expect strategy adaptation to be rare. Models are trained to be persistent. When a model starts down a path, it tends to continue down that path even when the path is not working. I would not be surprised if most models never pivot during a puzzle. They will either succeed with their first approach or fail with it. The models that do pivot will, I predict, score significantly higher overall because pivoting is how you recover from a bad start.

Third, I expect source utilization to follow the same pattern as the text-based pre-computation finding. Some models will never think to read the source. Some will read it selectively. Some will read it universally. I expect the distribution to be uneven, with most models clustered at "never" and a few outliers at "always." This is the kind of bimodal distribution that suggests an emergent capability rather than a smooth scaling relationship.

Fourth, I expect the signal detection metrics to remain the most informative measures even in the agentic context. d' and c do not care about the method. They care about the outcome. A model with high d' in the agentic benchmark is a model that reliably tells solvable from unsolvable regardless of how it gets there. This is the core measurement. Everything else is behavioral characterization that helps explain the d' value but does not replace it.

Fifth, and this is the one I am least confident about, I expect the agentic benchmark to reveal failure modes that do not exist in the text-based version. Specifically, I expect to see models that build correct tools but use them incorrectly, models that build incorrect tools and trust their output, and models that spend so much time building tools that they run out of time before using them. These are agentic failure modes. They do not test constraint reasoning. They test whether the model can manage a multi-step workflow under time pressure. I think these failures will be common and interesting, but I do not know what distribution they will follow.

XV. The Broader Point, Revisited

In my original post I argued that the benchmarks we use shape the models we build. I described explore-bench as an attempt to measure capabilities that existing benchmarks could not see. I was right about the motivation but wrong about the execution. explore-bench had structural problems that I could not solve within its own framework. The maze variants failed because language models are not spatial reasoners. The tool discovery variants failed because the scoring conflated multiple capabilities. The benchmarks I built were interesting to think about but would not have produced reliable measurements.

What I have now is something that actually works. The text-based benchmark produces clean signal detection metrics from a well-controlled task. The three-model comparison shows a clear scaling trajectory with interpretable differences in sensitivity, bias, and strategy. The agentic variant extends the same measurement framework to a richer action space without losing the theoretical foundations.

I want to be honest about what I do not know. I do not know whether performance on constraint satisfaction puzzles predicts performance on real-world agentic tasks. I believe it does, because the underlying cognitive demands, systematic reasoning under uncertainty, recognition of impossibility, strategic tool selection, are the same demands that appear in debugging, architecture design, and autonomous problem-solving. But I have not validated this belief empirically, and it might be wrong.

I also do not know whether the signal detection framework will continue to be informative at higher capability levels. It is possible that future models will achieve near-perfect discrimination and the d' metric will saturate, just as accuracy on MMLU saturated. If that happens, the benchmark will need new difficulty levels, new constraint types, or new task structures to remain discriminating. The adaptive difficulty system helps with this, but it is not a permanent solution.

What I do know is that I have a benchmark that measures something different from what other benchmarks measure, that it produces interpretable and theoretically grounded scores, that it differentiates models in ways that accuracy alone cannot, and that it extends naturally to agentic contexts where the question is not just "can you reason" but "can you figure out how to reason."

The benchmark will be publicly available. The generation engine, the scoring pipeline, the full transcripts from all three model evaluations, and the BBXA-1 specification are all open. If constraint satisfaction is not the right task, I want to know. If signal detection theory is not the right framework, I want to know. The whole point of publishing this is that I have been wrong before and I would rather find out sooner than later.

I think there is something here. I am more confident about that than I was the last time I wrote those words, and the last time I wrote those words, and the time before that. Each version has been better than the last. This one has data behind it, which is new. But data can be misleading, and confidence can be premature, and the only way to find out whether the experiment works is to run more of it.

Which, again, is the point.