Blogposts
-
One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer
A minimal architectural asymmetry — the input enters one update but not the other — is enough to make a shared-weight recurrent Transformer behave like two. Matches the two-network HRM baseline with half the parameters on Sudoku-Extreme (60.0% vs 55.0%) and Maze-30×30 (75.6% vs 74.5%). arXiv:2605.17811.
With Jucheng Shen and Barbara Su (Rice CS).
-
From PCA to LoRA: Why Fine-Tuning Could Have Been Parallel All Along
A 1933 deflation convention let rank-1 errors compound in LoRA fine-tuning. AdaPaD does it in parallel — and the errors correct themselves, provably. Best GLUE average (89.34) at matched 0.34M parameter budget; 3.62× per-batch speedup on 4 H200 GPUs. arXiv:2605.10741.
With Barbara Su, Fangshuo (Jasper) Liao (Rice CS).
-
GHOST: pruning Mamba2 by what each channel does
A forward-only state pruner for Mamba2 selective SSMs — controllability × observability, two forward passes, ~15 GB peak VRAM. ICML 2026.
With Michael Menezes (Rice CS).
-
How a little Gaussian dust changes how a network learns
Multiply every input by random noise. Training still converges — to a target whose distance from the global minimum we can write down.
With Afroditi Kolomvaki, Fangshuo (Jasper) Liao, Evan Dramko, Ziyun Guang (Rice CS).
-
Why Stochastic Gradient Descent Stops Just Short of the Edge
A closed-form sharpness gap explains a long-observed property of mini-batch training.
With Fangshuo (Jasper) Liao, Afroditi Kolomvaki (Rice CS).
-
Provable Acceleration of Nesterov's Momentum for Deep ReLU Networks
A new objective class that makes Nesterov provably accelerated for non-trivial neural architectures.
With Fangshuo (Jasper) Liao (Rice CS).
-
Provable Model-Parallel Distributed Principal Component Analysis with Parallel Deflation
A self-correcting parallel deflation scheme for distributed PCA, with convergence guarantees.
With Fangshuo (Jasper) Liao, Wenyi Su (Rice CS).