Toward a Differentiable Neural Computer on a Frozen-Embedding Substrate: PCA of a Constructed Neural Turing Machine, and a RAM-State Machine that Runs on the Substrate
Toward a Differentiable Neural Computer on a Frozen-Embedding Substrate: PCA of a Constructed Neural Turing Machine, and a RAM-State Machine that Runs on the Substrate
Abstract
A transformer with analytically computed (untrained) weights can execute
arbitrary WebAssembly programs — Percepta's transformer-vm. We read this artifact
as an autoregressive, deterministic Neural Turing Machine (NTM): attention is
used as exact, content/location-addressed memory access, the feed-forward layers are
the per-step compute, and the append-only token sequence is the machine's state. We
ask a concrete engineering question for building a Differentiable Neural Computer
(DNC) on Sutra — a typed functional language whose compiled program is a fused
tensor-op graph over a frozen embedding substrate: can the constructed
transformer's attention be reduced to a smaller, runnable core, and can an NTM-style
machine run on the Sutra substrate at all?
We report two measured results. First, principal-component / singular-value analysis of the constructed weights shows that magnitude-PCA is the wrong reduction lens for this machine: the weights span ~30 orders of magnitude (the hardmax temperature and address arithmetic produce singular values to ~1e30), so energy-fraction rank is dominated by a few giant "switch" directions while the small directions carry the actual byte logic. The honest reduction lever is the computation schedule, not the weight spectrum: of the nominal 19 heads × 7 layers = 133 attention head-slots, only 42 (31.6%) genuinely attend, concentrated in 5 layers (two attention layers are entirely zero), and the 915-symbol vocabulary embeds into a ~3-dimensional subspace. Second, we build a RAM-state stack machine that runs on the Sutra substrate and is Turing-complete (memory, arithmetic, bitwise, comparison, conditional branch, and backward-branch loops), with all machine state held in a RAM device and opcode dispatch performed by reading the opcode fresh from memory each step. The hard substrate questions (memory model, dispatch, state, side effects) are answered with measurements; the remaining work is breadth.
1. The artifact: a constructed, deterministic Neural Turing Machine
Percepta's transformer-vm is a standard softmax-ReGLU transformer whose weights are
constructed analytically from a computation-graph DSL, not trained. A C program
is compiled to WebAssembly; the supported WASM opcodes are encoded as byte-level
arithmetic over a residual stream that acts as machine memory (stack, locals, linear
memory, instruction cursor, call depth); a MILP solver schedules the graph nodes onto
transformer layers; the resulting tensors are written out. Replication results
(measured; WASM/FINDINGS.md): the analytic model reproduces the reference WASM
execution trace token-for-token on all 6/6 test programs, including a Sudoku
solver run of 1,055,417 tokens, at a mean of 18,049 tokens/s over 1,292,732
tokens (the authors report ~30K tok/s; the same order of magnitude). The analytic
model is small: d_model = 38, n_layers = 7, n_heads = 19, d_ffn = 44, vocab = 915;
the MILP solves to optimality in 5.5 s.
Read in the lineage of neural-memory architectures, this is an autoregressive,
deterministic NTM. A Neural Turing Machine (Graves, Wayne & Danihelka 2014) is a
neural controller with external memory whose read/write heads address memory via
attention; a Differentiable Neural Computer (Graves et al. 2016) adds dynamic
allocation and temporal links. Both are trained, differentiable, recurrent.
transformer-vm uses the same core mechanism — attention as content/location-
addressed memory access — but is constructed and exact (the addressing is hardmax,
never approximate), has no recurrent controller (the autoregressive loop plays that
role), and its memory is the append-only token sequence. The full framing is in
WASM/notes/significance_and_isomorphism.md.
2. Question and method
To build a DNC on Sutra we need a small, runnable attention core. The natural first attempt is to take the constructed transformer's weights and reduce their dimensionality by PCA/SVD. We (i) built the analytic transformer (the MILP schedule, cached), (ii) ran a full singular-value decomposition of every weight matrix, and (iii) measured how many attention heads genuinely attend per layer. All of this is analysis on the constructed weights, off any runtime path.
3. PCA result: magnitude is the wrong lens; the schedule is the lever
The analytic model has 144,286 parameters — d_model is already 38, so this is
not an over-provisioned embedding to shrink. SVD of the weight matrices shows an
extreme dynamic range: singular values reach ~1e30 (some matrices to 1e89–1e119),
produced by the hardmax temperature (HARD_K = 1e10) and the 2^k address/position
scales, down to ~1 for the byte logic. Consequently the energy-fraction "effective
rank" is dominated by a few giant directions and reports a misleadingly low rank: the
small-magnitude singular directions, which carry the actual computation, contribute
almost nothing to the Frobenius norm. Magnitude-PCA cannot truncate this machine —
dropping the small directions deletes the logic, not redundancy. (The squared
singular values overflow float32; the analysis runs in float64.)
What is reducible, measured honestly:
- Two of seven attention layers are entirely zero (
attn.5,attn.6: their input and output projections sum to exactly 0). The schedule places all attention in the first five layers; the last two are FFN-only. These attention blocks are directly removable. - The 915-symbol vocabulary embedding is genuinely low-rank: the token and head embedding matrices (915×38) carry 99% of their energy in 3 of 38 dimensions (90% in 1–2). No giant switches live there, so this is a magnitude-honest reduction.
- The attention core's reduction lever is the schedule, not the spectrum. Counting heads that genuinely attend (Q and K projection rows non-zero), only 42 of the nominal 133 head-slots (31.6%) are used — per layer 7, 5, 11, 11, 8, 0, 0. The reduced-attention target for a DNC realization is therefore ~⅓ of the nominal provisioning, concentrated in five layers, and must be obtained by scheduling fewer heads/dims in the computation graph rather than by SVD-truncating the constructed weights.
4. A RAM-state NTM-style machine that runs on the Sutra substrate
Independent of the reduction question, we tested whether an NTM-style machine can run
on the Sutra substrate at all. Sutra compiles to tensor operations over a frozen
embedding space; numbers live on synthetic axes; storage is the substrate RAM device,
read and written by ramRead/ramWrite (planning/sutra-spec/ram-pointers.md).
We hand-wrote a stack machine whose entire state — program counter, stack pointer, halt flag, the program, and the value stack — lives in RAM, and whose host driver issues one execution step per instruction (the same autoregressive shape as the transformer). Each step reads the opcode fresh from RAM and dispatches by comparing it against the opcode tags; this matters because a value read fresh from memory defuzzes cleanly against a literal, whereas a value carried across substrate loop iterations does not — so dispatch is driven from memory, not from loop-carried state. Per-opcode side effects are realized as single blended writes to fixed cells (each cell receives its new value if its opcode matched, otherwise a no-op rewrite of its existing value), which avoids address blending on the fuzzy substrate.
The machine is a genuine interpreter — the program is data in RAM — and is
Turing-complete: it has memory (LOAD/STORE), integer arithmetic
(ADD/SUB/MUL), bitwise AND (via a substrate bit-plane decomposition),
comparison (EQ/LT), a conditional branch (BR_IF), and HALT (11 opcodes).
Measured on the substrate: arithmetic programs (e.g. 3+4 = 7, 100+23 = 123,
chained 5×6−2 = 28), bitwise (12 AND 10 = 8), comparisons (3<5 = 1, 7==7 = 1),
a conditional branch taking or not taking by data, a STORE/LOAD round-trip, and a
backward-branch memory loop (a counter at one address, an accumulator at another;
each iteration increments the accumulator and decrements the counter, branching back
while non-zero) that yields acc = N for N = 1, 3, 5. All cases are guarded by a
regression test that compiles the machine and runs it on the substrate (13/13).
The OCaml realization of the reference machine is being transpiled to Sutra by an OCaml→Sutra frontend; the substrate primitives the machine needs (RAM-backed arrays, a substrate bitwise stdlib) are in place and individually verified.
5. What we are not claiming
- We do not claim a working DNC. We measured the reduction target for its attention and built a Turing-complete NTM-style machine on the substrate; we have not trained or assembled a differentiable neural computer.
- We do not claim the full 35-opcode
transformer-vmruns on the Sutra substrate. The substrate machine implements 11 opcodes and demonstrates the mechanism (memory, dispatch, loops); the remaining opcodes are breadth, and the reference's multi-megabyte linear memory exceeds the current host RAM device. - We do not claim PCA reduces the transformer. The measured result is the opposite: magnitude-PCA is misleading here; the reducible structure is the two zero attention layers, the ~3-dimensional vocabulary embedding, and the 42/133 genuinely used heads — the last obtainable only from the schedule.
- Throughput and replication figures are quoted from the replication measurements, not from the original authors; where they differ (≈18K vs ~30K tok/s) we report the measured value.
- We did not reproduce the unrelated arXiv "Neural Computers" paper
(2604.06425); it is cited as related work only, and the artifact under study is
Percepta's
transformer-vm.
Reproducibility
The analysis and the substrate machine are reproducible from the project repository:
the PCA/SVD and head-usage scripts (experiments/wasm_transformer_pca/), the
substrate machine and its regression test (experiments/iso5_substrate_dispatch/,
sdk/sutra-compiler/tests/test_mini_wasm_machine.py), and the replication of
transformer-vm (the WASM/ subtree, with the authors' code as a submodule).
Repository: https://github.com/EmmaLeonhart/Sutra
References
- A. Graves, G. Wayne, I. Danihelka. Neural Turing Machines. arXiv:1410.5401, 2014.
- A. Graves, G. Wayne, M. Reynolds, et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 2016 (the Differentiable Neural Computer).
- M. Zhuge, C. Zhao, H. Liu, et al. Neural Computers. arXiv:2604.06425, 2026 (related work; not the artifact studied here).
- Percepta-Core. transformer-vm / "Can LLMs Be Computers?" (code + blog; no arXiv). https://github.com/Percepta-Core/transformer-vm
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.