Blogposts

Which Layer Runs the Program? Anthropic just showed a transformer’s computation is organized by depth — a middle-layer “workspace.” On a tiny transformer whose correct layer layout we know exactly, we find what puts computation at a given depth: normalization — a knob you can turn to relocate a learned step (last-block share 0.88 without it, 0.38 with, and it survives a last-block-removal test). A ground-truthed, causal complement to the “where computation lives” question. In collaboration with Microsoft Research.
Can We Train a Computer? Two Ways to Point at a Memory We wired a transformer, by hand, to be a Turing-complete computer, then trained another from scratch to do the identical job. Both run it perfectly — yet to every weight-space test they look like different algorithms. They aren’t: gradient descent found a more perpendicular way to point at memory (K_off 0.89 vs 0.20–0.53), and that magnitude gap is what fools the standard test. A ground-truthed case for mechanistic probes over weight-space interpolation. In collaboration with Microsoft Research.
Which Valley, and How Deep: Training Neural Atomic Relaxation at a Fraction of the Memory A structure relaxation does two separable jobs — pick the right energy minimum, then settle to its bottom. Splitting them matches full backprop on silicon at 3.5× less memory and reaches more correct minima — with a checkable map of exactly when the trick helps. From the OptimaLab (Rice CS); one wrapper over ADAPT, eSEN-OC25 & GemNet-OC.
- 📏 What Counts as a Relaxed Structure? The companion question: ML potentials have no agreed metric for relaxation “success”. A sober map of the three metric families, why the global mean misleads (mean/median ≈ 1.3× on our own data), and a literature review crediting the six benchmarks that already built this conversation.
Two Ways to Slim a Model: Remember vs Recompute Which pruning rule is safe depends on whether a model part remembers (be careful, output-aware) or recomputes (be cheap, magnitude). Reconciles GHOST and Sakana's activation-sparsity result; a capability test (with CIs) shows where the cheap rule silently breaks. 5 models × 3 datasets, forward-only. Building on Michael Menezes’s GHOST (Rice CS).
One Rank at a Time: Cascading Error Dynamics in Sequential Learning When models are built up one piece at a time (LoRA, deflation PCA, OMP), per-step numerical errors compound geometrically through every later step — and the amplification is governed by the data's spectral gaps. Closed-form bound, schedule prescription (more-first, α≈1.5), validated on synthetic + LoRA on vision/language. arXiv:2505.22602, TMLR 2026. With Mahtab Alizadeh Vandchali and Fangshuo (Jasper) Liao (Rice CS).
One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer A minimal architectural asymmetry — the input enters one update but not the other — is enough to make a shared-weight recurrent Transformer behave like two. Matches the two-network HRM baseline with half the parameters on Sudoku-Extreme (60.0% vs 55.0%) and Maze-30×30 (75.6% vs 74.5%). arXiv:2605.17811. With Jucheng Shen and Barbara Su (Rice CS).
From PCA to LoRA: Why Fine-Tuning Could Have Been Parallel All Along A 1933 deflation convention let rank-1 errors compound in LoRA fine-tuning. AdaPaD does it in parallel — and the errors correct themselves, provably. Best GLUE average (89.34) at matched 0.34M parameter budget; 3.62× per-batch speedup on 4 H200 GPUs. arXiv:2605.10741. With Barbara Su, Fangshuo (Jasper) Liao (Rice CS).
GHOST: pruning Mamba2 by what each channel does A forward-only state pruner for Mamba2 selective SSMs — controllability × observability, two forward passes, ~15 GB peak VRAM. ICML 2026. With Michael Menezes (Rice CS).
How a little Gaussian dust changes how a network learns Multiply every input by random noise. Training still converges — to a target whose distance from the global minimum we can write down. With Afroditi Kolomvaki, Fangshuo (Jasper) Liao, Evan Dramko, Ziyun Guang (Rice CS).
Why Stochastic Gradient Descent Stops Just Short of the Edge A closed-form sharpness gap explains a long-observed property of mini-batch training. With Fangshuo (Jasper) Liao, Afroditi Kolomvaki (Rice CS).
Provable Acceleration of Nesterov's Momentum for Deep ReLU Networks A new objective class that makes Nesterov provably accelerated for non-trivial neural architectures. With Fangshuo (Jasper) Liao (Rice CS).
Provable Model-Parallel Distributed Principal Component Analysis with Parallel Deflation A self-correcting parallel deflation scheme for distributed PCA, with convergence guarantees. With Fangshuo (Jasper) Liao, Wenyi Su (Rice CS).