Optimization · Materials · Machine Learning

Which Valley, and How Deep: Training Neural Atomic Relaxation at a Fraction of the Memory

A structure relaxation quietly does two jobs at once — choosing which minimum to fall into, and settling precisely to its bottom. Treated as one problem, they force a false choice between accuracy and memory. Pulled apart, the choice disappears.

Yifei Zhang, Evan Dramko & Anastasios Kyrillidis · Rice University · July 2026

Report (PDF) Code

In brief

Relaxing an atomic structure to its nearest energy minimum is the inner loop of materials discovery. Doing it with a neural force field means iterating a map until the atoms stop moving — and the two ways to train that map trade off against each other: one is accurate but its memory grows with every step, the other is memory-flat but can polish the wrong answer. We show the two jobs inside a relaxation are separable. On silicon defects the split matches full backprop at 3.5× less memory and reaches more correct minima. Just as usefully, it comes with a checkable map of when the split helps — and when a good potential already does the job alone.

1The expensive inner loop

Almost every pipeline that searches for new materials keeps asking the same small question: given a guessed arrangement of atoms, what nearby arrangement is actually stable — the one where the net force on every atom is zero? Physicists call this a relaxation. Done with the gold-standard quantum method it is agonizingly slow, so the field increasingly hands the forces to a neural network — a machine-learned interatomic potential. Each step gets cheap. But the relaxation is still a loop: nudge the atoms along the predicted forces, recompute, repeat, until nothing moves.

Written down, that loop is a fixed-point map: a learned operator \(G_\theta\) applied over and over, \( \mathbf{X}_{t+1} = G_\theta(\mathbf{X}_t) = \mathbf{X}_t + \alpha\,F_\theta(\mathbf{X}_t) \), run until it reaches a structure \( \mathbf{X}^\star = G_\theta(\mathbf{X}^\star) \) that the potential declares force-free. Picture an energy landscape of hills and valleys. The relaxation is a ball rolling downhill; where it stops depends entirely on which valley it was over to begin with. That is the whole subtlety. A relaxation is doing two different things at once:

Two jobs, one loop. Basin selection decides which valley the relaxation falls into — a choice made early, while the ball is still near the ridge. Equilibrium precision is the separate job of settling exactly to the chosen valley's floor. Confuse the two and you inherit the worst of each.

2Two ways to train, each half a solution

To make the relaxation land on the right structure, you train the potential through the loop. There are two standard ways, and they sit at opposite corners of a trade-off.

Backprop-through-time (BPTT) differentiates the entire rollout. It sees how an early nudge changes the final structure, so it can steer the ball toward the right valley. The catch: it must remember every intermediate step, so its memory grows with the length of the relaxation.

And here is the part that's easy to miss — the reason memory is the whole fight. A “step” is not a cheap nudge of the atoms. Computing the forces means running a whole neural network: for these models, a graph network that passes messages between every pair of nearby atoms, hundreds of atoms at a time. To differentiate through one step you must keep that entire computation in memory — all of its internal activations. Now multiply by the length of the relaxation: hundreds of steps, each hoarding a full network's worth of activations. A single force call might be a few hundred megabytes; a few hundred of them stacked up is tens of gigabytes, and the GPU is out of room. That is why a training method whose memory doesn't grow with the number of steps is worth caring about.

Implicit differentiation (the implicit-function theorem, IFT) ignores the path and works only from the final force-free condition. Its memory is flat — constant, no matter how long the relaxation — which is exactly the appeal of deep equilibrium models. But it optimizes whichever valley the ball already landed in. From a bad start it will faithfully, efficiently polish the wrong minimum, and it can never tell you how to reach a better one.

The bigger idea

The trade-off looks fundamental only because both methods are asked to do both jobs. But the two jobs have opposite needs. Choosing the valley depends on the path and is decided early — it wants a short, differentiated window. Settling to the floor is a fixed-point problem — it wants constant memory and doesn't care about the path at all. So give each job the tool built for it.

3The split: guidance, then equilibrium

The method is almost embarrassingly literal. Run a short differentiated window at the start — just the first \(K\) steps — and use it for one job only: picking the valley. Then hand off to the implicit phase, which drives the relaxation to its exact fixed point at constant memory. The differentiated part comes first (that is when the valley is still in play); the memory-flat part comes last (that is when only precision is left). It is the reverse of how unrolled steps are usually bolted onto implicit models — and the order follows directly from which job each phase is for.

Readers coming from machine learning will recognize the move. Deep-equilibrium models have a trick called the phantom gradient: run a few unrolled steps as a cheap stand-in for the exact implicit gradient. We borrow the very same primitive — a few unrolled steps — but flip its purpose. We are not approximating the equilibrium gradient (we compute that one exactly); we are using the unrolled window to choose the basin. Same tool, opposite job. It is a small but real bridge between how the deep-learning world and the materials-simulation world think about the same loop — and a hint that primitives invented for one can be repurposed in the other.

Differentiate first, then go implicit. Only the short guidance window is unrolled and stored, so it — and it alone — sets the memory bill. The horizon \(K\) is a dial: turn it down for less memory, up for more steering. The long equilibrium tail costs the same memory whether it runs for ten steps or ten thousand.

Because only the guidance window is stored, memory becomes a knob you set, not a cost you pay. The horizon \(K\) trades memory for steering along a smooth frontier — and, crucially, it is decoupled from how long the relaxation runs.

**Memory is a dial.** Silicon point-defect relaxation, peak training memory vs. the guidance horizon \(K\), against full BPTT (7,470 MB). Accuracy (ΔQ, lower is better) is a statistical tie with BPTT across the useful range; 5 seeds.
Guidance horizon	Peak memory	vs. full BPTT	Accuracy (ΔQ)
K = 10	1,261 MB	5.9× less	277.0 ± 6.4
K = 15	1,755 MB	4.3× less	254.5 ± 9.8
K = 20 (sweet spot)	2,145 MB	3.5× less	248.3 ± 9.0
K = 30	3,230 MB	2.3× less	251.6 ± 5.9
full BPTT	7,470 MB	1.0× (baseline)	246.0 ± 5.8

4The result that matters: it selects better, not just cheaper

Saving memory would be a nice engineering win on its own. But the sharper claim is that the split reaches the right valleys — and to earn that claim you have to count basins, not bytes. On a set of 100 silicon point defects we asked, of each method: how many relaxations landed in the correct minimum?

Correct basins, out of 100. The pretrained potential alone lands right 43 times; polishing it with the path-blind implicit gradient makes things worse (25) — it just sharpens whichever minimum it started in. Full BPTT, which sees the path, reaches 79. The decomposition reaches 85, matching or beating BPTT while spending a fraction of its memory. The advantage is selection, and that is the point.

The reading is clean. The implicit gradient, blind to the path, is actively counter-productive for choosing a valley — exactly as the picture predicts. The short guidance window recovers all of BPTT's valley-picking power, and then some, for a fraction of the memory. Accuracy and efficiency turned out not to be opposed; they were two questions wearing one coat.

5When does the trick help? A map, not a promise

Here is where it would be easy to oversell. A method that wins on silicon is not a method that wins everywhere, and the honest contribution is knowing the difference in advance. The guidance window earns its keep only when it has a real job to do — a wrong valley to fix, decided early enough that a short window can reach the decision. That is four checkable conditions:

**The regime map.** The split has headroom only when all four conditions hold. Each “no” is a *coordinate* — a specific condition that is off — not a mysterious failure. We tested the first three rows; the fourth is the open frontier.
Setting (base potential)	Forces right?	Many valleys?	Base lands wrong?	Short relaxation?	Headroom?
Silicon defects weak, narrow potential	yes	yes	yes	yes	yes
Titanium adsorbate strong, in-domain potential	yes	yes	no	yes	no
Oxide surface (OC22) far-from-equilibrium	yes	yes	yes	no	no
Bulk crystals no in-domain potential yet	?	yes	likely	?	open

Read as a map, each row teaches something. On titanium, a strong, well-trained potential already lands in the right valley — so there is nothing to re-select, and the split has no job. That is not a defeat; it is the map correctly predicting “a good potential already solves this.” On oxide surfaces, the base really does land in wrong valleys — but the relaxations are roughly seven times longer, so the valley-deciding moment falls outside any short window, and a short window is the whole point. And bulk crystals — the most diverse setting of all — remain open, waiting on a potential trained in-domain. One thing holds across every row: the memory property is unconditional. It is only the selection value that depends on where you are on the map.

6A word for both sides of the aisle

This problem lives on a fault line between two communities that name the same things differently, and a surprising amount of the difficulty is translation. To a machine-learning reader a “basin” is obvious and a “slab” is jargon; to a materials scientist it is the reverse. So here is the small bilingual dictionary the whole idea rests on:

**The same objects, two vocabularies.** Left, materials-science terms for a machine-learning reader; right, machine-learning terms for a materials scientist.
Materials term	— is —	ML / optimization term	— is —
relaxation	rolling downhill to a force-free structure	basin selection	choosing which minimum you land in
basin / metastable state	one valley the ball can settle in	equilibrium precision	how exactly you reach that valley's floor
slab & adsorbate	a surface, and a molecule stuck to it	BPTT	differentiate the whole rollout (memory grows)
cell	the periodic box; fixed or free to change shape	fixed point / IFT	the force-free condition; its gradient at flat memory
relaxation trajectory	the path \( \mathbf{X}_0,\mathbf{X}_1,\dots\to\mathbf{X}^\star \)	guidance horizon \(K\)	how many steps you differentiate — the dial

Spelling it out is not throat-clearing; it is a small contribution in its own right. Half the reason “why not just use dataset X?” is hard to answer across the aisle is that the datasets look identical through one lens and completely different through the other. The map above is really a claim that “relaxation” is several different problems wearing one word — and that trajectory-level training is worth its cost on only some of them.

How we checked

The silicon numbers are five seeds each, held-out, with matched learning rate and training budget against a single shared base potential. The implicit and unrolled gradients were verified against finite differences in double precision, per backbone; the checkpointed guidance gradient is bit-identical to the naive one. The same wrapper was exercised across three different neural potentials (a cell-blind Transformer, an equivariant cell-aware model, a message-passing network) to show the method isn't wedded to one architecture. The beyond-silicon rows were run as pre-registered tests with kill conditions fixed in advance — which is why we can report the titanium and oxide outcomes as map coordinates rather than quietly reframing them after the fact.

What we don't claim

One demonstrated win, on silicon point defects, plus a map that says where else to expect one. We do not claim a second dataset win: on the settings we probed, a strong potential already solved the task (titanium) or the relaxation was too long for a short window (oxide). The guarantees that come with warm-start theory do not transfer to a learned, non-contractive operator, so we keep that framing informal. The bulk-crystal frontier is open, not conquered.

Why it matters

Strip away the vocabulary and one sentence survives: a relaxation is really asking two questions — which valley, and how deep — and you should answer them with two different tools. Answer them together and you overpay in memory for accuracy you could have had for free. Answer them apart and the dilemma dissolves. The rest — the 3.5×, the 85 out of 100, the four-column map — is what that one idea looks like when you hold it to the light and insist on checking.

Silicon results are five seeds, held-out, matched-budget; gradients verified against double-precision finite differences per backbone. Full write-up, the regime-of-validity analysis, and the machine-learning ↔ materials-science dataset landscape are in the report; code and reproduction at github.com/akyrillidis/G-DEQ.

← All AI-OWLS posts