Nothing: Building a Language from Raw Assembly and Letting Claude Optimize It

In the last post I introduced Agentic Experiments, a project about letting coding agents try to improve themselves and seeing what happens. That post was all framing. No code, no results, just the research structure and some speculation about what I thought would happen. This post is about the first experiment that actually produced something, and the something in question is a programming language toolchain built from scratch on AArch64 macOS, starting from raw assembly instructions and bootstrapping upward until we had an optimizing compiler with a benchmark suite.

The experiment is called nothing, because that's what we started with.

How the agent was prompted

Before getting into what was built, it's worth explaining how it was built, because the prompting methodology is part of the experiment.

The agent (Claude, via Claude Code) received a single document at the start: Project.md. This file lays out the philosophy, the bootstrap chain, the rules, and the agentic optimization setup. It's about 3,000 words. It specifies the target architecture (AArch64 macOS, Apple Silicon), the bootstrap sequence (assembler in assembly, IR compiler in assembly, optimization passes on top), the IR design (SSA form, typed, basic blocks, phi nodes), the rules (no LLVM, no GCC backend, no borrowed codegen, correct first and fast later), and the constraints for the agentic optimization loop (what the agent can modify, what it can't, what the metric is).

After that initial prompt, every subsequent instruction was some variation of "continue" or "keep working." That's it. No detailed follow-up prompts. No step-by-step guidance. The agent read the spec, internalized the bootstrap chain, and built the toolchain top to bottom with minimal human steering.

This matters for the experiment because the question isn't just "can Claude write a compiler." The question is whether a single well-structured document is enough to get an agent to execute a multi-stage systems programming project across thousands of lines of handwritten assembly, and then turn around and optimize the output of its own compiler using benchmark-driven iteration. The answer, at least for this first experiment, is that it mostly can.

What was built

The toolchain has three completed stages and a fourth in progress.

Stage 0 is an assembler written in AArch64 assembly. About 5,000 lines across nine modules: a lexer that tokenizes ARM64 assembly source text, a two-pass parser that collects labels on the first pass and encodes instructions on the second, an instruction encoder where each function is a pure leaf (no calls, no stack frame, just shifts and ORs to pack 32-bit instruction words), a Mach-O 64-bit object file emitter that builds the header, load commands, section content, symbol table, and string table in a 1MB buffer and writes it in a single syscall, a symbol table backed by a 256-bucket hash table with DJB2 hashing and chained entries, string utilities (comparison, parsing, memcpy, memset, all written by hand), lookup tables for 58 mnemonics and 17 condition codes, and error reporting that writes to stderr and exits.

The encoder is the part I find most interesting from a "how well does the agent understand the architecture" perspective. ARM64 has fixed-width 32-bit instructions with regular encoding fields. Each encoder function takes the instruction's operands as register arguments (sf, op, Rd, Rn, Rm, shift, amount) and returns the encoded 32-bit word in w0. The implementation is exactly what you'd write by hand: a base constant with the fixed bits, then a sequence of orr instructions to pack each field into its bit position. _enc_add_imm starts with 0x11000000 (the fixed bits for ADD immediate: sf|op|S|100010|sh|imm12|Rn|Rd), then ORs in Rd at [4:0], Rn at [9:5], imm12 at [21:10], S at [29], op at [30], sf at [31]. Every encoder function is a leaf, no frame setup, no calls, just bit manipulation. The agent wrote 30 of these covering arithmetic, logic, shifts, branches, loads, stores, pairs, PC-relative addressing, system instructions, conditional selects, and bitfield moves.

The assembler bootstraps once with the system as, and after that it's self-sufficient. 6 out of 7 tests pass; the hello world test is pending @PAGE/@PAGEOFF relocation syntax support, which is a macOS-specific addressing mode for data references.

Stage 1 is an IR compiler, also written in AArch64 assembly, about 4,000 lines across four modules. It takes SSA-form intermediate representation text files and produces AArch64 assembly. The IR has three types (i64, i8, ptr), 30 opcodes (arithmetic, bitwise, comparison, control flow, memory, SSA phi, type conversion), basic block structure, and function definitions with parameters and return types. The text format looks like this:

func @main() -> i64 {
entry:
  %x = add i64 10, 20
  %cond = cmp_lt i64 %x, 100
  br_cond %cond, @yes, @no
yes:
  ret i64 %x
no:
  ret i64 0
}

The codegen strategy at Stage 1 is intentionally naive. Every virtual register maps to a stack slot at a frame-pointer-relative offset. Every operation loads its operands from the stack into scratch registers, computes, and stores the result back. Phi nodes are handled by emitting register-to-register copies at the end of predecessor blocks, routed through the stack. This produces correct but terrible code, which is the point: the naive baseline is what we measure against.

Stage 2 is where the optimization work happens. This is a Python framework (about 1,200 lines in the main optimizing compiler, plus several hundred more across individual pass files) that reimplements the IR-to-assembly pipeline with register allocation, strength reduction, phi coalescing, compare-branch fusion, constant hoisting, and branch layout optimization. It also includes the benchmark suite (8 IR programs), the evaluation harness (compile, run three times, take median), and the agent configuration scripts that define what's modifiable and what isn't.

Stage 3 is currently in progress. It's a language frontend (lexer, parser, type checker, IR emitter) written in Python that compiles a C-like surface syntax down to our IR format. The lexer handles newline-sensitive statement termination with nesting-aware suppression (newlines inside parentheses and brackets are ignored). The parser is a standard recursive descent with precedence climbing for expressions, supporting let bindings, assignment, if/else, while loops, break/continue, function definitions, extern declarations, and type casts. The type checker does two passes (register signatures, then check bodies) and handles implicit bool-to-int conversion. The IR emitter does proper SSA construction with phi nodes for if/else join points and while loop headers, including pre-scanning loop bodies for assigned variables to create phi placeholders that get patched after the body is emitted. It handles short-circuit evaluation for && and || by splitting into separate basic blocks with phi nodes at the join.

The frontend already compiles and runs working programs. There's a test suite of 10 .lang files covering arithmetic, branching, loops, nested loops, function calls, if/else with variable updates, and a full Collatz sequence implementation. There are also benchmark versions of fib, sum, power, nested loops, and Collatz that match the Stage 2 IR benchmarks. But this stage is still being iterated on. It's functional but not complete.

What we were actually testing

The programming language itself isn't really the point. The point is what the experiment reveals about how Claude, operating through Claude Code, approaches a specific class of problem: performance optimization on a codebase it built, with a clear metric and a benchmark suite to measure against.

The question I was interested in is: how does an AI agent decide what to optimize? How does it decide when a particular optimization pass is done? How does it react when an optimization it implemented makes things worse instead of better? And can the code it generates, after iterative optimization, compete with what you'd get from a mature compiler targeting the same hardware?

We set up eight benchmarks covering different workload profiles. Iterative Fibonacci (50 million iterations, 3 phi variables), running sum (100 million, 2 phi variables), repeated multiply-by-3 (100 million), factorial (100 million), Euclidean GCD (10 million function calls with mod and phi), Collatz sequence (200,000 function calls with div, mod, and branches), nested loops (8000 by 8000, 64 million iterations), and XOR+AND accumulation (100 million). Each was scaled to produce 100 to 250 millisecond baseline times on Apple Silicon so the measurements would be meaningful. Every benchmark ends with a mod 256 and returns the result as the exit code, so correctness is verifiable.

The baseline is Stage 1's naive compiler. The optimized version is Stage 2 after Claude has worked through its optimization passes.

What Claude did

The optimization work happened in seven passes, and the order and priority decisions are part of what makes this interesting.

The first and largest pass was linear scan register allocation. The implementation computes live intervals via iterative dataflow analysis, computing gen/kill sets per block, propagating live-in/live-out sets to a fixed point, then deriving the start and end sequence numbers for each virtual register from the combined liveness information. It sorts intervals by start point, then greedily assigns physical registers from two pools: callee-saved (x19-x28, ten registers) for values live across function calls, and caller-saved scratch (x9-x15, seven registers) for short-lived values. When both pools run out, it spills to the stack frame with explicit load/store pairs at spill points.

This single pass accounts for about 2.2x of the total speedup. The Fibonacci inner loop went from 22 instructions to 8. In the baseline, over 80% of instructions were loads and stores shuffling values between the stack and scratch registers. After register allocation, the inner loop runs entirely in registers. The fact that Claude went for this first tells you something about how it prioritizes: it looked at the generated assembly, saw that stack traffic was the dominant cost, and attacked it directly.

The second pass was phi coalescing. For each phi node, it detects self-update patterns where the incoming value on the back-edge is computed from the phi result (like %i_next = add %i, 1 feeding back into %i = phi [..., %i_next, @loop]). Before coalescing, it checks that the phi result isn't used as a source by another phi on the same back-edge, which would create a classic parallel-assignment bug where overwriting one register destroys a value another phi still needs. When it's safe, it merges the live intervals so both get the same physical register. The copy disappears entirely. This turned the power benchmark's inner loop from 7 instructions (including two mov instructions for phi copies) to 4 instructions with zero copies. The add x10, x10, x10, lsl #1 (multiply by 3 via shifted add) and add x9, x9, #1 (increment counter) both operate in-place on their phi registers.

The third pass was strength reduction for multiplication. It replaces mul by known small constants with shifted-add sequences. Power of 2 becomes a left shift (1 cycle instead of 3-4 for mul). A value of the form 2^k + 1 becomes add x, x, x, lsl #k, also 1 cycle. The power benchmark's multiply-by-3 became add x10, x10, x10, lsl #1.

The fourth pass was strength reduction for division and modulo, and this was the second biggest win after register allocation. It replaces div by a power of 2 with asr (arithmetic shift right, 1 cycle versus about 10 for sdiv) and mod by a power of 2 with and against a bitmask (1 cycle versus 10+ for the sdiv/msub pair the naive codegen produces). The Collatz benchmark has mod 2 and div 2 in its hot inner loop, so this pass alone pushed it from a 2x speedup to 4x. Every benchmark also has a mod 256 in its epilogue, which went from a multi-cycle division sequence to a single and x, x, #255.

The fifth pass fused compare-and-branch sequences. The naive codegen emits cmp, cset, cbnz (three instructions) for every conditional branch. When the comparison result is only used by the immediately following br_cond, Claude's pass emits cmp/b.cc (two instructions, and often one fewer cycle because the branch can issue in the same cycle as the comparison on Apple Silicon). It also uses cbz/cbnz for comparisons against zero, which folds the comparison into the branch instruction itself.

The sixth pass was constant hoisting, where immediate values that require multiple instructions to materialize (anything used in mul, div, mod, or large constants for cmp) get pre-loaded into callee-saved registers in the function prologue. The agent checks which callee-saved registers aren't already used by the allocator and steals up to four of them (x25-x28) for the constant pool. On Apple Silicon this adds about 0.1x because out-of-order execution hides the constant materialization latency anyway, but on an in-order core it would matter more.

The seventh pass improved branch layout by analyzing which side of a conditional branch has phi copies. If only one side needs copies, it inverts the condition to branch over the copy-free path and fall through to the copies, eliminating one label and one unconditional branch per loop iteration.

The optimization that made things worse

Claude implemented an assembly-level peephole optimizer that eliminated redundant loads, folded immediates into instructions, and reduced instruction count by 10 to 16% across the benchmarks. It made everything slower. 0.68x, a 32% regression.

Apple Silicon has store-to-load forwarding. A load that immediately follows a store to the same address completes in 0 to 1 cycles. The "redundant" loads the peephole pass was removing weren't actually costing anything. But removing them changed instruction alignment relative to the CPU's fetch blocks, which degraded performance. On a modern out-of-order processor, instruction count is a poor proxy for execution time.

What matters is what Claude did next. It measured the result, saw the regression, and reverted the pass. It didn't try to fix the peephole optimizer or rationalize the regression. The number was worse, so the change was bad.

The results

The geometric mean speedup across all eight benchmarks was 3.36x. Individual results:

Benchmark	Baseline	Optimized	Speedup	Status
fib	109ms	19ms	5.74x	Near hardware limit
bitops	233ms	41ms	5.68x	Near hardware limit
nested_loop	138ms	24ms	5.75x	Near hardware limit
sum	158ms	34ms	4.65x	Near hardware limit
collatz	128ms	35ms	3.66x	Function call overhead
gcd	98ms	32ms	3.06x	Function call overhead
power	144ms	64ms	2.25x	Shift-add latency
factorial	143ms	94ms	1.52x	mul instruction latency

The more interesting number is how close some of these are to what the hardware can physically do. Factorial runs in 94ms. The mul instruction on Apple M-series has 3 to 4 cycle latency. At 100 million iterations and 3.2GHz, the theoretical minimum is 3 × 100M ÷ 3.2G = 94ms. The generated code is at the hardware limit. There is no faster version of this program on this chip. C and Rust would produce the same 94ms.

Fibonacci is the weird one. Three instructions on the critical path per iteration gives a naive estimate of 47ms, but we measure 19ms. The CPU is exploiting instruction-level parallelism to execute the phi copies (which are mov instructions after coalescing couldn't eliminate them because of the three-variable phi pattern) in parallel with the critical path computation. The optimizer is generating code that's structured well enough for the hardware to find and exploit the available parallelism.

The benchmarks that still have headroom are the ones with function call overhead (GCD and Collatz). The optimizer doesn't do inlining yet.

What this says about agentic optimization

A few things stood out about how Claude approaches performance work when given a clear metric and the freedom to iterate.

It prioritizes correctly. Register allocation first because it eliminates the dominant cost. Strength reduction for division second because it targets the highest-latency instructions. The smaller passes come after the big wins are locked in. An agent could easily get distracted by micro-optimizations while ignoring the fact that 80% of instructions are unnecessary stack traffic. Claude didn't do that.

It respects the benchmark. When the peephole pass made things slower, it reverted. It didn't argue. It didn't try to explain why the regression was actually fine. The number was worse, so the change was bad.

It knows when to stop. The constant hoisting pass added 0.1x. Claude implemented it, measured it, noted the marginal improvement, and moved on rather than spending more iterations trying to squeeze out another 0.05x. The structured benchmark comparison makes it clear which benchmarks have headroom and which don't, and the agent redirects effort accordingly.

And the bootstrapping chain works as a development methodology. The agent operated at the Stage 2 level (Python optimization passes) without needing to understand or modify the 9,000 lines of assembly underneath. Each layer only needs to be correct, not optimal. The assembler produces valid Mach-O files. The IR compiler produces valid assembly. The optimizer makes the assembly faster. You can improve any layer independently. That separation made the optimization work tractable for an agent working from a single spec document and a series of "continue" prompts.

What's next

Stage 3 is still being built out. It needs more type system coverage (arrays, structs), standard library stubs for I/O via libSystem, and there are probably edge cases in the SSA construction for nested control flow that haven't been hit yet. The optimization pipeline needs function inlining (which should close most of the gap on GCD and Collatz), loop unrolling, instruction scheduling, and eventually SIMD vectorization. The assembler needs @PAGE/@PAGEOFF relocation support. And the long-term self-hosting goal is to rewrite the assembler in the language it assembles.

But for the purposes of the Agentic Experiments program, nothing answered its question. Claude can take a single spec document, build a multi-stage systems programming project from it with minimal prompting, then iteratively optimize its own compiler's output using benchmark-driven experimentation, arriving at results that are competitive with hand-optimized code on several benchmarks and at the theoretical hardware limit on others. It can do this without being told what to optimize. It looks at the generated code, identifies what's expensive, and fixes it.

I don't know if that generalizes to harder problems. Optimizing a compiler you built from scratch is a clean task with clear metrics and fast feedback. Real software engineering problems are messier. But as a first data point for what an agent can do when you give it a good spec and get out of the way, the results are better than I expected.

More experiments to come.

View the git repository here: https://github.com/kcodes0/nothing