Sequential learning · low-rank regression · error propagation

One Rank at a Time:
Cascading Error Dynamics in Sequential Learning

When models are built up one piece at a time, mistakes in the first piece don't stay there — this paper pins down, mathematically, how they cascade.

Authors: Mahtab Alizadeh Vandchali · Fangshuo (Jasper) Liao · Anastasios Kyrillidis Venue: TMLR (accepted, in press) arXiv:2505.22602 PDF Code (GitHub)

The Upshot ▸ Sequential rank-1 deflation amplifies early numerical error through every later step. The amplification factor is governed by the spectral gaps of the data; when gaps are large the cascade is damped, when gaps are small it can blow up. The result gives a quantitative answer to the practical question: how should you allocate compute across the steps when training one rank at a time?

§01Setup

Many modern training schemes — low-rank adaptation (LoRA) being the canonical example — proceed in stages. You fit a small low-rank update, freeze it, then fit another, and so on. Each subsequent step learns on the residual left by the previous ones. It is convenient. It is widely used. And it is largely uncharacterized: how does a numerical mistake in step 1 leak into step 2, 3, … $r$? (A companion piece, AdaPaD: From PCA to LoRA, asks what changes if you do this same deflation in parallel instead — the present paper is the sequential half of that picture.)

This paper studies that question in the cleanest non-trivial setting where the math goes through. Consider linear low-rank regression:

$$\mathbf{Y} \;\approx\; \mathbf{W} \mathbf{X}, \qquad \mathbf{W} = \mathbf{B}\mathbf{A}, \qquad \mathrm{rank}(\mathbf{W}) = r \ll \min(m, d).$$

Instead of solving for $\mathbf{W}$ jointly, the sequential algorithm peels off the dominant rank-1 component, subtracts its prediction from the target, and repeats:

$$(\mathbf{a}_k, \mathbf{b}_k) \;=\; \arg\min_{\mathbf{a},\,\mathbf{b}} \tfrac{1}{2}\big\| \mathbf{Y}_k - \mathbf{b}\,\mathbf{a}^\top \mathbf{X} \big\|_F^2, \qquad \mathbf{Y}_{k+1} \leftarrow \mathbf{Y}_k - \mathbf{b}_k\,\mathbf{a}_k^\top \mathbf{X}.$$

Two observations make this worth studying. First, sequential rank-1 deflation is structurally identical to PCA-style algorithms, OMP-style greedy methods, and several recent PEFT recipes — a unified analysis pays back across multiple methods. Second, each step uses a finite-precision solver. There is always some per-step residual error $\boldsymbol{\Psi}_k$. The question is whether these errors stay small or compound.

§02The cascade

The paper's central object is the cumulative training error after $r$ rank-1 steps. Theorem 1 bounds it by

$$\Big\| \mathbf{Y} - \sum_{k=1}^{r} \mathbf{b}_k \mathbf{a}_k^\top \mathbf{X} \Big\|_F \;\le\; \underbrace{\Big(\sum_{k \gt r} (\sigma_k^\star)^2 \Big)^{1/2}}_{\text{truncation tail}} \;+\; \underbrace{\sum_{k=1}^{r} \Big(\prod_{j \lt k} \rho_j\Big)\, \big\|\boldsymbol{\Psi}_k\big\|_F}_{\textbf{cascade amplification}}$$

Two terms. The truncation tail is the unavoidable cost of throwing away components past rank $r$ — it is what you would also pay with a perfect oracle SVD. The interesting term is the second: each per-step error $\boldsymbol{\Psi}_k$ is amplified by a product of factors

$$\rho_j \;=\; 2 + 6\,\frac{\sigma_j^\star}{\mathcal{T}_j^\star},$$

where $\sigma_j^\star$ is the $j$-th singular value of the output matrix $\mathbf{Y}$ and $\mathcal{T}_j^\star = \sigma_j^\star - \sigma_{j+1}^\star$ is the $j$-th singular gap of $\mathbf{Y}$.

Theorem 1 in the paper (informal). In the perturbation regime where the cumulative error stays below half the smallest gap, the residual training error is the sum of the truncation tail plus a product-amplified sum of per-step errors. Small gaps of $\mathbf{Y}$ inflate the product; large gaps damp it.

The cascade is geometric in the worst case. If every gap is barely positive, a small error in step 1 can grow by a factor of, say, 8 by step 8 — and that is on top of fresh errors introduced at every later step.

§03Parameter recovery

Theorem 1 is about training error — how well does the model fit the data we trained on? A separate, harder question is parameter recovery: how close is the learned $\widehat{\mathbf{W}}$ to the true $\mathbf{W}^\star$?

Theorem 2 in the paper — noiseless recovery

Under positive singular gaps of $\mathbf{Y}$ and a positive minimum singular value of $\mathbf{X}$, the recovery error inherits the cascade structure. The condition number $\kappa(\mathbf{X})$ replaces the data scale — well-conditioned design helps; ill-conditioned design hurts.

Theorem 3 in the paper — Gaussian label noise

With i.i.d. $\mathcal{N}(0,\varepsilon^2)$ label noise and a small-noise ordering condition, the recovery error decomposes into a (cascading) signal term plus an additive noise term that scales as $\varepsilon \sqrt{n \log(1/\gamma)}$. A standard bias-variance trade-off in the truncation rank $r$ appears.

§04Compute allocation: more-first

The cascade structure has an immediate practical consequence. Errors at early steps get amplified through every later step; errors at late steps are not. So if you have a fixed total compute budget $T$ for solving the $r$ per-step rank-1 sub-problems, you should spend more on the early steps.

The paper formalizes this with a one-parameter family of schedules indexed by a trend parameter $\alpha$. Setting $x_k = 1 - 2(k-1)/(r-1)$, the per-step compute is

$$t_k(\alpha) \;\propto\; \max\big(1 + \alpha x_k,\,10^{-9}\big), \qquad \sum_{k=1}^{r} t_k = T.$$

Positive $\alpha$ shifts the budget toward earlier components; negative $\alpha$ toward later. Three regimes:

Schedule	$\alpha$	Effect
less-first	−1	Early steps starved; cascade amplifies errors. Worst empirically.
equal	0	Naïve uniform allocation. Baseline.
more-first	+1 to +1.5	Saturates around $\alpha \approx 1.5$. Best empirically.

§05Empirics — synthetic linear setting

The synthetic experiments instantiate the family at $(r, T, m) = (20, 500, 2)$. Three named schedules are compared head-to-head, then $\alpha$ is swept on a fine grid.

Three-panel synthetic figure: reconstruction error, training objective error, and cumulative numerical-error proxy for less-first / equal / more-first schedules under fixed total budget T=500. — **FIGURE 1.** Three schedules under fixed total budget $T=500$, rank $r=20$. *Left:* reconstruction error $\lVert\hat{\mathbf W} - \mathbf W^\star\rVert$. *Middle:* training objective $\lVert\mathbf Y - \hat{\mathbf W}\mathbf X\rVert_F$. *Right:* cumulative numerical-error proxy $\sum_{j\le k}\lVert\boldsymbol{\Psi}_j\rVert_F$, where $k$ is the active component at cumulative iteration $t$. Open-circle markers are component-switch points. Vertical dashed lines mark $t_1$, the budget for the first component.

The three panels say the same thing three ways. Whenever the first component gets more budget — left panel: $t_1$ is far to the right; right panel: the green curve rises slowly — the final error is smaller. The right panel makes the cascade visible: less-first accumulates numerical error at a much higher rate, exactly the failure mode Theorem 1 predicts.

Continuous α-sweep: Δ reconstruction error vs α. — **FIGURE 2 (LEFT).** $\Delta$ final reconstruction error vs schedule trend $\alpha$. Negative $\Delta$ = better than the equal schedule.

Continuous α-sweep: Δ cumulative numerical error vs α. — **FIGURE 2 (RIGHT).** $\Delta$ final cumulative numerical-error proxy vs $\alpha$. Same setup, same trend. Color bar: first-component budget $T_1(\alpha)$.

The fine-grained sweep is asymmetric. Moving from $\alpha = 0$ into the more-first half ($\alpha > 0$) buys substantial gains until roughly $\alpha \approx 1.5$, where the curves flatten. Pushing further yields diminishing returns. Moving in the other direction ($\alpha < 0$) sharply degrades both quantities. How much you front-load matters; just that you front-load is the qualitative pattern.

§06Does the pattern survive outside linear regression?

The paper's theorems are stated for fixed-design linear regression. To probe how far the qualitative scheduling pattern travels, the authors also study sequential rank-1 LoRA on simple vision tasks (MNIST, CIFAR10, CIFAR100) and a language task (DistilBERT on SST-2). The vision experiments are in the paper; here we show the SST-2 results because they are the cleanest cross-domain transfer.

SST-2 bar plot: sequential rank-1, rank-2, rank-3 vs jointly trained rank-3 LoRA accuracy. — **FIGURE 6.** SST-2 final accuracy as sequential rank-1 components are added. The first sequential component (0.872) already matches the jointly trained rank-3 LoRA reference; the second and third add 0.003 and 0.001.

The bar chart says something almost embarrassing: on this task, the first rank-1 update via the sequential procedure already matches the jointly trained rank-3 LoRA. The second and third components push accuracy up by less than half a percentage point each. Whatever capacity sequential rank-1 adaptation is offering, it is concentrated in the first component — exactly as the cascade analysis would predict, in qualitative terms.

SST-2 accuracy vs adaptation FLOPs, swept over more-first α schedules. — **FIGURE 7.** SST-2 accuracy vs. adaptation-training FLOPs. Orange trajectories: more-first schedules at various $\alpha \in \{0.25, 0.5, \ldots, 2.0\}$. Black: jointly trained rank-3 LoRA reference. Green: equal-schedule sequential baseline.

Two patterns. First, most more-first schedules sit above the equal-schedule baseline (green) at intermediate compute — the schedule choice matters in addition to the total budget. Second, within the more-first branch the dependence on $\alpha$ is not monotone — the orange curves cluster closely at higher compute, so once early components already get a substantial share, further front-loading produces diminishing differences. The same pattern as the synthetic experiment, in a domain the theory does not formally cover.

Caveat ▸ These LoRA experiments are exploratory probes, not theorem-validated extensions. The theorems hold for fixed-design linear regression. The qualitative pattern transfers; the proof does not. Treat the deep-learning results as evidence that the linear theory's intuition survives in adjacent settings — not as a guarantee, and not as a state-of-the-art benchmark claim.

§07What this paper does NOT claim

The abstract is unusually explicit about this, and it is worth restating. The paper is an analysis framework, not a sales pitch for a new method.

Not a generalization theory — Theorems 2/3 are fixed-design parameter recovery, not test-error bounds.
Not superiority over joint or one-shot training — the experiments show comparable accuracy, not improvement.
Not a tight bound — Theorem 1 is an upper bound, not matched lower.
Not a theorem-grade extension to nonlinear PEFT — the LoRA experiments are outside the proof's regime.
Not an optimal-schedule prescription — only that front-loading helps and back-loading hurts, in the linear setting.
Not applicable when the true matrix has repeated singular values (gap = 0). That requires invariant-subspace tracking, left to future work.

§08Why anyone would care

Three reasons.

Method-agnostic insight. The cascade analysis covers any algorithm that builds up a model one low-rank piece at a time. That includes the LoRA family, OMP-style greedy methods, deflation-based PCA, and several incremental compression schemes. One theorem informs the design of all of them.

Resource allocation guidance. The more-first schedule is a concrete, implementable recommendation: when training in stages on a fixed compute budget, spend disproportionately more on the first few stages. It costs nothing to apply and the experiments show consistent improvement across linear regression and two cross-domain LoRA probes.

Honest scoping. The paper does not claim a new state-of-the-art method or beat LoRA on benchmarks. It explains why sequential methods behave the way they do. In a field where most papers oversell, an explanatory contribution that cleanly delineates what is and is not proven is increasingly valuable as a foundation other work can build on.

Where this fits. This work is the linear half of a longer programme. The companion piece — AdaPaD: From PCA to LoRA — asks what happens if you do the deflation in parallel instead of sequentially: the workers don't compound errors; their bounds self-correct across rounds. Reading the two together makes the trade-off explicit: sequential rank-1 is cheap and analytically clean, but it cascades; parallel rank-1 is communication-bound but self-corrects. The follow-up on nonlinear sequential rank-1 (SG-LoRA) is in progress.

§09Open questions

Open ▸ Tight lower bounds matching the cascade product factor — is the worst case exponential in $r$ when gaps are small, or is the bound loose?

Open ▸ Extension to data-dependent (random) design and out-of-sample generalization bounds. The fixed-design assumption is the proof's main scope limit.

Open ▸ Treatment of repeated singular values ($\mathcal{T}_k^\star = 0$) via invariant-subspace tracking (Davis–Kahan style). The current proof requires strict gaps.

Open ▸ Theorem-grade analysis of nonlinear PEFT (e.g., the SG-LoRA / sequential LoRA family). The probes here show the pattern transfers; proving it does not.

§10Citation

@article{vandchali2026onerank,
  title   = {One Rank at a Time: Cascading Error Dynamics in Sequential Learning},
  author  = {Vandchali, Mahtab Alizadeh and Liao, Fangshuo and Kyrillidis, Anastasios},
  journal = {Transactions on Machine Learning Research (TMLR)},
  year    = {2026},
  note    = {Accepted, in press. arXiv:2505.22602}
}

One Rank at a Time:Cascading Error Dynamics in Sequential Learning

§01Setup

§02The cascade

§03Parameter recovery

§04Compute allocation: more-first

§05Empirics — synthetic linear setting

§06Does the pattern survive outside linear regression?

§07What this paper does NOT claim

§08Why anyone would care

§09Open questions

§10Citation

One Rank at a Time:
Cascading Error Dynamics in Sequential Learning