Can LLMs Be Computers?
Language models can solve tough math problems at research grade but struggle on simple computational tasks that involve reasoning over many steps and long context. Even multiplying two numbers or solving small Sudokus is nearly impossible unless they rely on external tools.
But what does it take for an LLM itself to be as reliable and efficient as a computer?
We answer this by literally building a computer inside a transformer. We turn arbitrary C code into tokens that the model itself can execute reliably for millions of steps in seconds.
Here is how it works when solving an optimization problem that proceeds in many steps, namely min-cost perfect matching via the Hungarian algorithm.
Solve the min-cost perfect matching for this 10×10 cost matrix:
61 58 35 86 32 39 41 27 21 42
59 77 97 99 78 21 89 72 35 63
88 85 37 57 59 97 37 29 69 94
32 82 53 20 77 96 21 70 50 61
15 44 81 10 64 36 56 78 20 69
76 35 87 69 16 55 26 37 30 66
86 32 74 94 32 14 24 12 31 70
97 63 20 64 90 21 28 49 89 10
58 52 27 76 61 35 17 91 37 66
42 79 61 26 55 98 70 17 26 86
31,621 tok/s384,118 tokens6,874 lines/s
21213720151612101712345678910
The model does not call an external tool. Instead, it executes the program directly via its transformer weights, producing an execution trace token by token and streaming results at more than 30k tokens/sec on a CPU.
The key technical idea is a new decoding path for execution traces that turns the model's attention lookups from linear scans into queries that take logarithmic time, enabling millions of correct execution steps inside a single transformer run.
Motivation: LLMs cannot compute
State-of-the-art language models can solve impressively hard mathematics, with systems able to reach gold-medal standard on the International Mathematical Olympiad or even tackle open Math and Science problems.
At the same time, a stubborn gap remains: even state-of-the-art models still stumble on tasks that should be purely computational. They make mistakes even on basic addition, and they fail to solve even simple Sudoku puzzles without outside help. Benchmarks such as Sudoku-Bench make this failure mode especially visible, reporting low unaided solve rates.
In practice, we bridge this gap using two families of workarounds:
- Tool use: the model writes code; an external interpreter executes it; the model reports the result. Tool-integrated approaches measurably improve math reasoning.
- Agentic orchestration: an outer loop stores intermediate state, decomposes tasks, and repeatedly calls the model on short contexts, effectively bolting a state machine onto the outside.
These approaches are extremely useful but they highlight an important limitation: LLMs do not reliably perform long, exact computations on their own, so in practice we often delegate the execution to external tools or orchestration systems.
An analogy makes the distinction clear. Humans cannot fly. Building airplanes does not change that; it only means we built a machine that flies for us.
Today's language models are in the same situation. When a task requires exact computation, we attach external systems (interpreters, code runners, agent loops) and let those systems do the computing.
But that means the capability still lives outside the model. The model itself is fundamentally handicapped: it cannot carry out the computation it is reasoning about, so it must repeatedly hand the task off to another system.
As a result, the model can describe algorithms, reason about them, or orchestrate tools that run them, but it cannot execute the steps itself. A system that cannot compute cannot truly internalize what computation is.
So the real question is not whether a model can talk about computation, or even access it through tools. The real question is whether it can execute computation internally: reliably, efficiently, and over very long horizons. If it could, it would stop being merely a coordinator of computation and become a computer itself.
How we turned LLMs to computers
On the surface, nothing is broken. The transformer architecture that powers LLMs is incredibly capable. Multiple works have shown that different variants of transformers can carry out complex computations because they can simulate Turing machines and hence any effective computer. Recent work shows that they can even achieve these computational capabilities via training.
But theoretical universality is not the same as practical execution. A model can be expressive enough to simulate a computer and still be a terrible computer in practice. Classical universality results often rely on constructions where simple operations in a real machine correspond to long sequences of steps in the simulated model. In other words, representational capability alone does not guarantee practical execution efficiency.
We tackle this by implementing a modern RAM computer inside the transformer rather than a purely theoretical model of computation. In our construction, each instruction maps to only a handful of tokens (at most 5).
But even that is not enough. The deeper problem is the decoding process itself.
Transformers have a structural handicap as executors: standard autoregressive decoding makes each step interact with the full accumulated history. A real computer updates a compact state with roughly constant work per instruction. A transformer instead generates one token at a time while continually interacting with a prefix that keeps growing. KV caching avoids recomputing past key/value projections, but decoding still requires attending over the cached prefix, so work per step remains coupled to sequence length.
We remove this handicap by showing that the same transformer architecture admits an efficient decoding scheme for execution-style traces. The key technical unlock is to restrict lookup heads to head dimension 2, which enables a decoding path where the dominant retrieval/update operations can be computed in log time in the sequence length (for this structured executor regime), rather than by a full prefix-sized attention sweep.
This is powerful enough to let us execute arbitrary programs inside a transformer for millions of steps.
The rest of this post walks through how this works. Feel free to skip ahead:
- What does computation mean? — How in-model execution differs from tool use: the transformer runs the program itself, step by step.
- More demos: Sudoku — Executing a compiled solver inside the transformer to solve the Arto Inkala Sudoku, regarded as the hardest Sudoku in the world.
- How can computation be encoded? — Execution as a trace that only appends, and how a transformer reconstructs state by looking back at prior steps.
- Exponentially Fast Attention — The core unlock: 2D heads turn attention into a convex-hull query, enabling log-time decoding over million-step traces.
- What is next? — Richer attention, training at scale, compiling programs into weights, and growing AI systems like software.
What does computation mean?
These days, when a model needs to compute something, it writes code. For example, to add 3 + 5, a model might output:
Generation then pauses. An external tool-use mechanism intercepts the code block, sends it to a sandboxed interpreter, and injects the result (8) back into the token stream. The model resumes, now knowing the answer.
This works, but the actual execution happened outside the model. The model specified the computation, then waited for an external system to carry it out.
Our transformer also emits a program, but instead of pausing for an external tool, it executes that program itself, step by step, within the same transformer.
To achieve this, we implemented a WebAssembly interpreter inside the transformer weights. WebAssembly is a low-level instruction set designed for fast, deterministic execution and a universal target that languages such as C and C++ can compile to. To compute 3 + 5, the model would write:
{
i32.const 03 00 00 00
i32.const 05 00 00 00
i32.add 00 00 00 00
output 00 00 00 00
}
The model then switches to fast decoding mode and executes the program itself, step by step within the same transformer, producing an execution trace of tokens:
03 00 00 00 commit(+1,sts=1,bt=0)
05 00 00 00 commit(+1,sts=1,bt=0)
08 00 00 00 commit(-1,sts=1,bt=0)
out(08)
halt
Each line contains tokens the model generates. The stack grows, the add fires, the result is output, and the machine halts, all within the model's own output stream, with no external round-trip.
codeexternal round-tripwasm programexecution trace
The key difference is that tool use is opaque: the model hands off control and receives a black-box answer. In-model execution is transparent: every intermediate step appears in the trace, and the model never leaves its own decoding loop.
More demos: Sudoku
Sudoku is another useful stress test for exact long-horizon computation. Learned neural approaches can achieve strong performance on easy or random Sudokus but completely stumble on the hard ones. A common explanation is that autoregressive models are fundamentally ill-suited for constraint-satisfaction problems because they commit to answers token-by-token and cannot revise early mistakes. But our fully autoregressive system achieves 100% accuracy on these benchmarks, suggesting that the real bottleneck is not the autoregressive paradigm itself — it is that solving hard Sudokus requires very long execution traces, and standard attention makes long-context generation prohibitively expensive. This is exactly the problem our fast attention path addresses.
Our system executes a fully correct compiled Sudoku solver inside the transformer itself. There is no learned heuristic standing in for the algorithm and no gap between "the model suggested a solution" and "an external system verified it". The transformer executes the solver step by step.
The guarantee is universal rather than benchmark-specific: if the compiled solver is correct, the transformer's execution is correct as well. In practice, this lets the model solve even famously hard instances such as Arto Inkala's Sudoku, reaching the correct solution in under 3 minutes.
Solve this Sudoku puzzle:
34,293 tok/s4,973,756 tokens7,554 lines/s
How can computation be encoded?
A useful way to picture an autoregressive transformer is as a machine that lives inside its own history. A traditional computer has editable memory that it updates as a result of operations. But in a transformer there is no such thing.
There is a fixed prompt (the input or the program). Then there is a trace that only grows (the tokens the model generates). At each step, the model can only "look back" through a small number of queries (attention heads), and must then append one more token. Through this process we need to find a way to encode a working machine.
To build intuition about this model and how the encoding works, it is helpful to think about the following.
Computation as a trace that only appends
To understand how a transformer can execute a program internally, it helps to think of computation in a slightly unusual way.
Imagine a notebook where every step of a computation is written on the next line. Once written, earlier lines cannot be changed; the notebook only grows.
- The first few lines contain the input (the prompt).
- Each new line records the next step of the computation.
- At every step you are allowed to look back at earlier lines, but you cannot edit them.
This is surprisingly close to how an autoregressive transformer operates: the prompt is the input, the generated tokens form a trace that keeps growing, and each new token is produced after looking back at a small number of positions via attention.
Here is a toy example. Given a sentence, count whether the number of verbs is odd or even. Each trace token attends to exactly two positions: the corresponding input word (to check if it is a verb) and the previous trace token (to read the running parity).
Notice that each step only needs two lookbacks (one into the prompt and one into the trace), regardless of how long the sentence is. This is the key insight: many algorithms can be expressed as a trace that only appends, where each step reads a small, fixed number of earlier positions.
Can a computation be represented as a trace that only appends, where each step only needs to look back a small number of times?
The answer is yes. In our system the model generates such a trace explicitly. The tokens it produces represent the evolving state of a virtual machine: instruction pointer, memory and stack operations, arithmetic, control flow, and outputs. The model reconstructs the current state simply by looking back to the relevant earlier steps.
Technical Note: What can a transformer look back at?
The rest of this post focuses on the efficiency challenge: even if a transformer can represent such an execution trace, standard decoding still pays a growing cost as the trace becomes longer. Our fast decoding path removes that handicap, and the 2D head restriction is the key unlock that makes the fast path possible.
The key unlock: Exponentially Fast Attention
A real computer runs long programs by updating a compact state (registers, stack, memory) with near-constant work per instruction.
A standard transformer "advances state" by generating tokens, and each next token is computed by attention over a prefix that grows forever. KV caching reuses previously computed keys/values, but it does not remove the basic scaling: at each step the query still has to interact with a cache whose size grows with the number of generated tokens. This is why decoding research often becomes an IO problem: the bottleneck becomes "make the KV cache fast enough and score against it."
But no matter how fast one makes KV caching, the fundamental scaling remains: the ttt-th decoding step still interacts with a prefix of length ttt. This means work per step grows linearly with the trace length, and the total cost of generating ttt tokens grows quadratically. This is a well-known bottleneck of transformers and it is an active area of research how to design faster alternatives.
As we'll explain below, our approach addresses the quadratic blow-up and yields exponentially faster attention lookups. Instead of spending Θ(t)\Theta(t)Θ(t) time per step, our method requires O(logt)O(\log t)O(logt) time. Here is how it looks in practice, when using our HullKVCache compared to a standard KVCache:
41,709 tok31,037 tok/s(6,747 lines/s)
Line 9,067 / 9,580Done in 1.3s
35,614 tok185 tok/s(40 lines/s)
Line 7,742 / 9,580258.9s total
0s50s100s150s200s250s010k20k30k40k1.3stokens generatedtime (s)
In the workloads we care about (long execution traces where the model repeatedly performs a small number of structured lookups per step), the difference in scaling per step compounds.
A linear scan decoder pays a cost that keeps growing as the trace grows. A hull-based decoder keeps the cost per step essentially tied to a retrieval primitive that runs in logarithmic time. Over long horizons, that changes what is feasible: computations that would be impractically slow under standard decoding become runnable for millions of steps.
This is also why the payoff shows up most clearly on "boring" deterministic spans: exact copying, state-machine stepping, or long mechanical traces. These are precisely the steps where we do not want to spend full attention budget.
The fast path: 2D attention
We obtain this speed-up by reframing the question. We are not attempting to speed up transformers in full generality, nor do we want to introduce yet another architecture.
Instead, we focus on vanilla transformers but under a tractable parameterization. We specifically target the case where the dimension of the heads is small: 2D.
This does not mean that the whole transformer is small. You can still have an arbitrarily large number of layers, arbitrarily many heads and arbitrarily large embeddings. It just means that the embeddings used across layers of a transformer are broken into smaller chunks resulting in more heads nheads=dmodel/2n_\text{heads} = d_\text{model} / 2nheads=dmodel/2.
Of course, this restriction can come at a cost for certain tasks and is not the magic answer to everything. While we are still exploring how limiting this is in practice for training, say, large models, we find that it is still flexible enough to train efficiently and can capture very complicated logic.
In fact, for Turing completeness, 2D attention is all you need! And it is still flexible enough to represent a whole RAM computer as we show in this blog. It is a crucial enabler of building a fast-path in transformers. While abstract reasoning and planning remain part of the original path, heavy computational tasks can go through the fast path.
The model itself is a completely standard PyTorch transformer, nothing exotic:
class VanillaTransformer(nn.Module): def __init__(self, vocab, d_model=36, n_heads=18, n_layers=7, d_ffn=36): super().__init__() self.tok = nn.Embedding(vocab, d_model) self.attn = nn.ModuleList([ nn.MultiheadAttention(d_model, n_heads, batch_first=True, bias=False) for _ in range(n_layers) ]) self.ff_in = nn.ModuleList([nn.Linear(d_model, 2*d_ffn, bias=False) for _ in range(n_layers)]) self.ff_out = nn.ModuleList([nn.Linear(d_ffn, d_model, bias=False) for _ in range(n_layers)]) self.head = nn.Linear(d_model, vocab, bias=False) def forward(self, idx): T = idx.shape[1] x = self.tok(idx) + pos_emb(T) causal = torch.triu(torch.ones(T, T, device=idx.device, dtype=torch.bool), diagonal=1) for attn, ff_in, ff_out in zip(self.attn, self.ff_in, self.ff_out): y, _ = attn(x, x, x, attn_mask=causal, need_weights=False) x = x + y gate, val = ff_in(x).chunk(2, dim=-1) x = x + ff_out(F.relu(gate) * val) return self.head(x)
Notice: d_model = 36 with n_heads = 18 gives exactly 2D per head. The architecture uses 7 layers, standard nn.MultiheadAttention, and a gated feed-forward network. No custom attention kernels, no sparse masks; just vanilla PyTorch. The only thing that makes it special is the weights.
Let's look at how 2D attention gets us these impressive speed-ups.
The geometric view of 2D attention
The attention mechanism works as follows:
- each past token contributes a key vector kjk_jkj and a value vjv_jvj,
- the current step forms a query vector qqq,
- relevance scores are computed for every key q⋅kjq \cdot k_jq⋅kj, and
- the head returns the sum of values of all keys reweighted by the softmax of their scores.
In the special case that matters for our fast path:
- each key is 2D, kj∈R2k_j \in \mathbb{R}^2kj∈R2,
- the query q∈R2q \in \mathbb{R}^2q∈R2 can be thought of as a direction in the plane,
- and we want the value of the point that maximizes the dot product in that direction (hard-max attention). In case of ties, we compute the average.
Despite the maximization queries being global over the whole history, there exist efficient data-structures that can answer such 2D attention queries in logarithmic time in the number of points. That is exactly the classic "supporting point" query in computational geometry: given a direction qqq, find the point on the convex hull furthest in that direction. We speed up decoding by maintaining such a data-structure as tokens get generated which lets us efficiently search over the whole set of past points.
Restricting the head dimension to 2 is what enables the fast path: during inference we can replace a brute force scan ("score against every key") with a structure where the maximum can be found by looking only at a tiny subset of points in the hull.
Technical Note: Why do two dimensions suffice?
So what is next?
We showed that a transformer can become a computer, and that opens a new interface between software and neural networks. Once programs can run efficiently inside the model's own inference loop, a much larger design space opens up.
Richer attention mechanisms
For simplicity, our construction uses hard-max attention. But that is not a fundamental restriction. While we do not yet know whether exact softmax attention can be maintained with the same efficiency, it is easy to approximate it with k-sparse softmax attention: retrieve the top-kkk keys and perform the softmax only over those. By storing points across nested convex hulls, this yields a decoding cost of O(k+logn)O(k + \log n)O(k+logn).
This means the geometric fast path is not limited to our executor construction. In principle, it can accelerate any transformer with 2D heads at decoding time, replacing full linear scans with efficient geometric retrieval. The same machinery also extends naturally to 3D heads via 3D convex hulls, although higher dimensions quickly become less efficient. The key question is whether 2D already captures most of the speedup, or whether slightly larger heads unlock substantially more capability.
qouter hull · inner hull · top-k points
k-sparse softmax retrieves the top‑k keys from nested convex hulls and applies softmax only over those — cost O(k + log n).
The same geometric machinery extends to 3D heads via 3D convex hulls.
The 2D-head parameterization used here is not inherently small. The total parameter count can remain comparable to standard transformers because the model can simply use more heads and layers. The real question is how capable such models become when trained at scale.
That question matters even beyond pure capability. These models could be useful in several modes: as a dedicated fast path paired with a slower, more general model; as part of a fast/slow hybrid architecture inside a single system; or as a speculative execution model that proposes tokens quickly while a regular-attention model verifies and accepts them. Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.
d_model = 256 (same budget)
Dedicated executor for heavy computation
2D heads handle execution while full-dim heads reason
2D model proposes tokens; a larger model verifies
The system presented here is a standalone transformer designed as an executor. A natural next step is to combine it with a large language model and train the model to invoke this execution pathway when exact computation is required. In such a hybrid system the language model would plan and reason, while the execution component would run algorithms.
Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.
Programs into weights & training beyond gradient descent
In our current prototype the model learns an interpreter whose behavior is encoded in its weights. But the compilation machinery we built for generating those weights can go further. In principle, arbitrary programs can be compiled directly into the transformer weights, bypassing the need to represent them as token sequences at all.
That would make weights themselves a deployment target for software. Instead of merely learning software-like behavior, models could literally contain compiled program logic as part of their internal circuitry.
Weights become a deployment target: instead of learning software-like behavior, models contain compiled program logic.
If logic can be compiled into weights, then gradient descent is no longer the only way to modify a model. Weight compilation provides another route for inserting structure, algorithms, and guarantees directly into a network.
Taken seriously, this suggests a different picture of training altogether: not just optimizing weights with data, but also writing parts of the model directly. Push that idea far enough and you get systems that do not merely learn from experience, but also modify or extend their own weights, effectively rewriting parts of their internal machinery.
Growing AI systems like software
Finally, if software becomes part of the neural architecture, then AI systems need a way to grow over time much like software libraries do today. Modern software ecosystems evolve by accumulating modules, abstractions, and reusable components. A similar process may eventually occur inside AI systems, where new computational abilities are added incrementally to the model's internal execution engine.
coremathsortinggraphcryptoDSPnew
Just as software ecosystems grow by accumulating modules, AI systems can add computational abilities incrementally — each one compiled into the model's internal execution engine.
New capabilities connect to the existing circuit without retraining the whole system.
Our work shows that transformers can execute programs efficiently inside their own inference loop. The broader vision is that future AI systems will not just use software; they will contain it, integrating learned representations with compiled algorithms inside a single computational substrate. In that world, software itself becomes part of the model.
Closing thoughts
We showed that transformers can execute programs efficiently inside their own inference loop, not as an external tool, but as part of the model itself. This opens a path toward AI systems that integrate learned representations with compiled algorithms inside a single computational substrate.
We think this matters because the hardest problems we care about, sequential decision-making under uncertainty in healthcare, supply chains, and financial institutions, will require systems that can both reason flexibly and compute reliably.
We are building these systems now, and we are hiring. If you want to work on problems at the intersection of language models, reinforcement learning, and real-world decision systems, join us.