A structure relaxation quietly does two jobs at once — choosing which minimum to fall into, and settling precisely to its bottom. Treated as one problem, they force a false choice between accuracy and memory. Pulled apart, the choice disappears.
Relaxing an atomic structure to its nearest energy minimum is the inner loop of materials discovery. Doing it with a neural force field means iterating a map until the atoms stop moving — and the two ways to train that map trade off against each other: one is accurate but its memory grows with every step, the other is memory-flat but can polish the wrong answer. We show the two jobs inside a relaxation are separable. On silicon defects the split matches full backprop at 3.5× less memory and reaches more correct minima. Just as usefully, it comes with a checkable map of when the split helps — and when a good potential already does the job alone.
Almost every pipeline that searches for new materials keeps asking the same small question: given a guessed arrangement of atoms, what nearby arrangement is actually stable — the one where the net force on every atom is zero? Physicists call this a relaxation. Done with the gold-standard quantum method it is agonizingly slow, so the field increasingly hands the forces to a neural network — a machine-learned interatomic potential. Each step gets cheap. But the relaxation is still a loop: nudge the atoms along the predicted forces, recompute, repeat, until nothing moves.
Written down, that loop is a fixed-point map: a learned operator \(G_\theta\) applied over and over, \( \mathbf{X}_{t+1} = G_\theta(\mathbf{X}_t) = \mathbf{X}_t + \alpha\,F_\theta(\mathbf{X}_t) \), run until it reaches a structure \( \mathbf{X}^\star = G_\theta(\mathbf{X}^\star) \) that the potential declares force-free. Picture an energy landscape of hills and valleys. The relaxation is a ball rolling downhill; where it stops depends entirely on which valley it was over to begin with. That is the whole subtlety. A relaxation is doing two different things at once:
To make the relaxation land on the right structure, you train the potential through the loop. There are two standard ways, and they sit at opposite corners of a trade-off.
Backprop-through-time (BPTT) differentiates the entire rollout. It sees how an early nudge changes the final structure, so it can genuinely steer the ball toward the right valley. The catch: it must remember every intermediate step, so its memory grows with the length of the relaxation.
And here is the part that's easy to miss — the reason memory is the whole fight. A “step” is not a cheap nudge of the atoms. Computing the forces means running a whole neural network: for these models, a graph network that passes messages between every pair of nearby atoms, hundreds of atoms at a time. To differentiate through one step you must keep that entire computation in memory — all of its internal activations. Now multiply by the length of the relaxation: hundreds of steps, each hoarding a full network's worth of activations. A single force call might be a few hundred megabytes; a few hundred of them stacked up is tens of gigabytes, and the GPU is simply out of room. That is why a training method whose memory doesn't grow with the number of steps is worth caring about.
Implicit differentiation (the implicit-function theorem, IFT) ignores the path and works only from the final force-free condition. Its memory is flat — constant, no matter how long the relaxation — which is exactly the appeal of deep equilibrium models. But it optimizes whichever valley the ball already landed in. From a bad start it will faithfully, efficiently polish the wrong minimum, and it can never tell you how to reach a better one.
The trade-off looks fundamental only because both methods are asked to do both jobs. But the two jobs have opposite needs. Choosing the valley depends on the path and is decided early — it wants a short, differentiated window. Settling to the floor is a fixed-point problem — it wants constant memory and doesn't care about the path at all. So give each job the tool built for it.
The method is almost embarrassingly literal. Run a short differentiated window at the start — just the first \(K\) steps — and use it for one job only: picking the valley. Then hand off to the implicit phase, which drives the relaxation to its exact fixed point at constant memory. The differentiated part comes first (that is when the valley is still in play); the memory-flat part comes last (that is when only precision is left). It is the reverse of how unrolled steps are usually bolted onto implicit models — and the order follows directly from which job each phase is for.
Readers coming from machine learning will recognize the move. Deep-equilibrium models have a trick called the phantom gradient: run a few unrolled steps as a cheap stand-in for the exact implicit gradient. We borrow the very same primitive — a few unrolled steps — but flip its purpose. We are not approximating the equilibrium gradient (we compute that one exactly); we are using the unrolled window to choose the basin. Same tool, opposite job. It is a small but real bridge between how the deep-learning world and the materials-simulation world think about the same loop — and a hint that primitives invented for one can be repurposed in the other.
Because only the guidance window is stored, memory becomes a knob you set, not a cost you pay. The horizon \(K\) trades memory for steering along a smooth frontier — and, crucially, it is decoupled from how long the relaxation actually runs.
| Guidance horizon | Peak memory | vs. full BPTT | Accuracy (ΔQ) |
|---|---|---|---|
| K = 10 | 1,261 MB | 5.9× less | 277.0 ± 6.4 |
| K = 15 | 1,755 MB | 4.3× less | 254.5 ± 9.8 |
| K = 20 (sweet spot) | 2,145 MB | 3.5× less | 248.3 ± 9.0 |
| K = 30 | 3,230 MB | 2.3× less | 251.6 ± 5.9 |
| full BPTT | 7,470 MB | 1.0× (baseline) | 246.0 ± 5.8 |
Saving memory would be a nice engineering win on its own. But the sharper claim is that the split reaches the right valleys — and to earn that claim you have to count basins, not bytes. On a set of 100 silicon point defects we asked, of each method: how many relaxations landed in the correct minimum?
The reading is clean. The implicit gradient, blind to the path, is actively counter-productive for choosing a valley — exactly as the picture predicts. The short guidance window recovers all of BPTT's valley-picking power, and then some, for a fraction of the memory. Accuracy and efficiency turned out not to be opposed; they were two questions wearing one coat.
Here is where it would be easy to oversell. A method that wins on silicon is not a method that wins everywhere, and the honest contribution is knowing the difference in advance. The guidance window earns its keep only when it has a real job to do — a wrong valley to fix, decided early enough that a short window can reach the decision. That is four checkable conditions:
| Setting (base potential) | Forces right? | Many valleys? | Base lands wrong? | Short relaxation? | Headroom? |
|---|---|---|---|---|---|
| Silicon defects weak, narrow potential | yes | yes | yes | yes | yes |
| Titanium adsorbate strong, in-domain potential | yes | yes | no | yes | no |
| Oxide surface (OC22) far-from-equilibrium | yes | yes | yes | no | no |
| Bulk crystals no in-domain potential yet | ? | yes | likely | ? | open |
Read as a map, each row teaches something. On titanium, a strong, well-trained potential already lands in the right valley — so there is nothing to re-select, and the split has no job. That is not a defeat; it is the map correctly predicting “a good potential already solves this.” On oxide surfaces, the base really does land in wrong valleys — but the relaxations are roughly seven times longer, so the valley-deciding moment falls outside any short window, and a short window is the whole point. And bulk crystals — the most diverse setting of all — remain genuinely open, waiting on a potential trained in-domain. One thing holds across every row: the memory property is unconditional. It is only the selection value that depends on where you are on the map.
This problem lives on a fault line between two communities that name the same things differently, and a surprising amount of the difficulty is translation. To a machine-learning reader a “basin” is obvious and a “slab” is jargon; to a materials scientist it is the reverse. So here is the small bilingual dictionary the whole idea rests on:
| Materials term | — is — | ML / optimization term | — is — |
|---|---|---|---|
| relaxation | rolling downhill to a force-free structure | basin selection | choosing which minimum you land in |
| basin / metastable state | one valley the ball can settle in | equilibrium precision | how exactly you reach that valley's floor |
| slab & adsorbate | a surface, and a molecule stuck to it | BPTT | differentiate the whole rollout (memory grows) |
| cell | the periodic box; fixed or free to change shape | fixed point / IFT | the force-free condition; its gradient at flat memory |
| relaxation trajectory | the path \( \mathbf{X}_0,\mathbf{X}_1,\dots\to\mathbf{X}^\star \) | guidance horizon \(K\) | how many steps you differentiate — the dial |
Spelling it out is not throat-clearing; it is a small contribution in its own right. Half the reason “why not just use dataset X?” is hard to answer across the aisle is that the datasets look identical through one lens and completely different through the other. The map above is really a claim that “relaxation” is several different problems wearing one word — and that trajectory-level training is worth its cost on only some of them.
The silicon numbers are five seeds each, held-out, with matched learning rate and training budget against a single shared base potential. The implicit and unrolled gradients were verified against finite differences in double precision, per backbone; the checkpointed guidance gradient is bit-identical to the naive one. The same wrapper was exercised across three different neural potentials (a cell-blind Transformer, an equivariant cell-aware model, a message-passing network) to show the method isn't wedded to one architecture. The beyond-silicon rows were run as pre-registered tests with kill conditions fixed in advance — which is why we can report the titanium and oxide outcomes as map coordinates rather than quietly reframing them after the fact.
One demonstrated win, on silicon point defects, plus a map that says where else to expect one. We do not claim a second dataset win: on the settings we probed, a strong potential already solved the task (titanium) or the relaxation was too long for a short window (oxide). The guarantees that come with warm-start theory do not transfer to a learned, non-contractive operator, so we keep that framing informal. The bulk-crystal frontier is open, not conquered.
Strip away the vocabulary and one sentence survives: a relaxation is really asking two questions — which valley, and how deep — and you should answer them with two different tools. Answer them together and you overpay in memory for accuracy you could have had for free. Answer them apart and the dilemma dissolves. The rest — the 3.5×, the 85 out of 100, the four-column map — is what that one idea looks like when you hold it to the light and insist on checking.
Silicon results are five seeds, held-out, matched-budget; gradients verified against double-precision finite differences per backbone. Full write-up, the regime-of-validity analysis, and the machine-learning ↔ materials-science dataset landscape are in the report; code and reproduction at github.com/akyrillidis/G-DEQ.
← All AI-OWLS posts