Rank-1 as a Building Block for Million-Node Max-3-Cut

Incremental scoring, 2-eigenvector rounding, and hybrid warm-starts at extreme scale

Ria Stevens, Fangshuo Liao, Barbara Su, Jianqiang Li, Anastasios Kyrillidis — OptimaLab

Department of Computer Science, Rice University

April 2026

1. From GPUs to a Single CPU

In our previous post, we threw 15 GPUs at Max-3-Cut and pushed the rank-2 algorithm to $n = 3{,}600$ nodes. The rank-2 approach is powerful — it enumerates $O(n^3)$ candidate solutions and provably contains the global optimizer when the objective has rank 2 — but its cubic scaling hits a wall around a few thousand nodes, even with parallelism.

What if we drop one rank? The rank-1 algorithm reduces the candidate set to $O(n)$ and needs no GPU at all. The trade-off seems clear: less computation, worse quality. But three algorithmic insights change this picture dramatically:

Incremental scoring — exploit sparsity to evaluate candidates in $O(n \cdot d)$ total instead of $O(n^2)$
2-eigenvector complex rounding — fix a degeneracy that was losing one-third of the solution space
Hybrid warm-start — use the rank-1 solution to seed greedy local search

Together, these let us solve million-node graphs in minutes on a laptop-class CPU — and the hybrid solver beats greedy local search on every single instance we tested.

Rank-1 as a standalone solver and as an initializer

A key theme of this work is that rank-1 serves a dual role. As a standalone solver, the rank-1 phase sweep produces a complete partition by rounding eigenvector phases to roots of unity. On graphs with strong spectral structure — torus, lattice, and other highly regular families — this solution is already optimal or near-optimal. On torus graphs, rank-1 beats every other method we tested, including simulated annealing.

But rank-1 also serves as an initializer for other algorithms. Local search methods like greedy hill-climbing are fast but trapped by their starting point: from a random initialization, they converge to whatever local optimum is nearest. The rank-1 solution provides a spectrally informed starting point that encodes global structure of the graph — information that pure local search is blind to. We call this combination Hybrid: run rank-1 first, then run greedy local search starting from the rank-1 partition instead of a random one.

The distinction matters because the hybrid solver is not simply “rank-1 but better.” It is a different algorithm that composes spectral global information (rank-1) with local optimization (greedy). The two components have complementary strengths: rank-1 captures global structure but ignores local details; greedy refines local details but is blind to global structure. Together, they consistently outperform either method alone.

nodes solved

45/45

hybrid beats greedy

0 GPUs

required

~10 lines

of new code

Show the ~10 lines

# The complete incremental rank-1 sweep
for phase_idx in range(K * n):
    # Try flipping vertex i to partition j
    i = order[phase_idx]
    for j in range(K):
        if j == k[i]: continue
        delta_z = roots[j] - roots[k[i]]
        delta = 2 * np.real(np.conj(delta_z) * Qz[i]) + np.abs(delta_z)**2 * Q_diag[i]
        if delta > best_delta:
            best_delta, best_i, best_j = delta, i, j
    if best_delta > 0:
        # Commit flip and update Qz incrementally
        delta_z = roots[best_j] - roots[k[best_i]]
        Qz += delta_z * Q_csc[:, best_i]  # O(degree) sparse update
        k[best_i] = best_j

2. The $O(n^2)$ Bottleneck

Recall the setup. Given a graph Laplacian $\mathbf{Q}$ and its top eigenvector $\mathbf{v}$, the rank-1 phase sweep constructs an initial partition $\mathbf{z}$ from $\mathbf{v}$'s phases, then sweeps through $n$ boundary points, flipping one node at a time. At each step, we need the new score $\mathbf{z}^\dagger \mathbf{Q} \mathbf{z}$.

The naive approach recomputes this score from scratch: multiply $\mathbf{Q}\mathbf{z}$, then dot with $\mathbf{z}$. Each evaluation costs $O(\text{nnz})$ for a sparse matrix with nnz non-zeros. Over $n$ sweeps, the total is $O(n \cdot \text{nnz})$. For a 5-regular graph (where nnz $= 5n$), this gives $O(n^2)$ — manageable at $n = 1{,}000$, but 17 hours at $n = 100{,}000$. With incremental scoring (described next), that same computation drops to under 2 seconds: a 30,000× speedup.

The key observation

The key insight: $\mathbf{z}^\dagger \mathbf{Q} \mathbf{z}$ is a sum over every pair of vertices $(i, j)$ weighted by $Q_{ij}$. When we flip only vertex $i$ from one partition to another, only the terms involving $i$ can change. These interactions collapse into a single cached quantity, $(\mathbf{Q}\mathbf{z})_i$ — essentially "how vertex $i$ sees the rest of the current partition." By maintaining this cache, each candidate flip costs one lookup instead of re-evaluating the entire sum.

Concretely, when node $i$ flips from $z_\text{old}$ to $z_\text{new}$, the score change has a closed-form expression:

\Delta = 2 \cdot \text{Re}\big(\overline{\delta z} \cdot (\mathbf{Q}\mathbf{z})_i\big) + |\delta z|^2 \cdot Q_{ii}

where $\delta z = z_\text{new} - z_\text{old}$. This is $O(1)$ per candidate. The catch is maintaining the cached product $\mathbf{Q}\mathbf{z}$, which requires updating the rows affected by the flip: $(\mathbf{Q}\mathbf{z})_j \mathrel{+}= Q_{ji} \cdot \delta z$ for each neighbor $j$ of node $i$.

For a graph with average degree $d$, each update touches $d$ entries. Over $n$ flips, the total cost is $O(n \cdot d)$ instead of $O(n^2)$. On a 5-regular graph, that is a $n/5$ speedup — a factor of 20,000 at $n = 100{,}000$.

The sparse update in code

The critical implementation detail: directly index into the CSC sparse matrix structure. A seemingly equivalent approach — Q.getcol(i).toarray() — allocates an $n$-dimensional vector on every call, turning $O(d)$ into $O(n)$.

# The wrong way: O(n) per update due to dense allocation col = Q_csc.getcol(i).toarray().flatten() Qz += col * dz # The right way: O(degree) per update via direct CSC access start = Q_csc.indptr[i] end = Q_csc.indptr[i + 1] rows = Q_csc.indices[start:end] vals = Q_csc.data[start:end] Qz[rows] += vals * dz

3. Fixing the Rounding: Two Eigenvectors for Three Partitions

For Max-3-Cut, we need to assign each node to one of three partitions, represented by the cube roots of unity: $\{1, \omega, \omega^2\}$ where $\omega = e^{2\pi i/3}$. The rank-1 algorithm maps eigenvector entries to partitions based on their complex phase angles.

The problem: graph Laplacians are real symmetric, so their eigenvectors are real-valued. A real number has a complex phase of either 0 or $\pi$, mapping to at most 2 of the 3 partitions. One-third of the solution space is unreachable.

2-eigenvector complex rounding

Instead of using a single real eigenvector, combine the top two:

q_i = v_1(i) + j \cdot v_2(i)

Now $\text{arg}(q_i)$ spans the full $[0, 2\pi)$ circle, and all three partitions are reachable. Requesting $k=2$ eigenvectors instead of $k=1$ adds negligible time to the eigensolve.

Why two eigenvectors unlock the third partition

Left: a real eigenvector can only produce phases 0 and π, leaving one sector empty. Right: combining two eigenvectors as $v_1 + j \cdot v_2$ spreads phases around the full circle.

The effect on torus graphs is striking: with the fix, rank-1 finds the best solution among all methods, including simulated annealing. Without it, rank-1 is limited to a bipartition of a 3-colorable structure.

4. Scaling to a Million Nodes

With incremental scoring, the sweep phase scales linearly in $n \cdot d$. The bottleneck shifts to the eigenvalue computation (ARPACK via scipy.sparse.linalg.eigsh), which scales roughly as $O(\text{nnz} \cdot k_\text{iter})$.

Rank-1 timing breakdown on 5-regular graphs

Sweep time grows linearly with n. Eigensolve dominates at large n due to ARPACK iteration count.

5. Results: 45 Instances from 10K to 1M Nodes

We tested five methods on 45 graph instances across five families: 5-regular, toroidal grid, Erdős–Rényi, Delaunay triangulations, and road networks.

Methods

Random: best of 1,000 random $K$-colorings. A random partition achieves $(1 - 1/K)$ of the total edge weight in expectation, providing the baseline.
Rank-1: 2-eigenvector phase sweep with incremental scoring (this work).
Greedy: steepest-ascent local search from random initialization. Each iteration moves one node to the partition that maximizes the cut increase; the algorithm terminates when no improving move exists. This is the natural generalization to Max-$K$-Cut of the classic Max-Cut swap heuristic, which guarantees a $(1-1/K)$-approximation [1]. We use incremental scoring for $O(\text{degree})$ per move.
Hybrid: greedy local search warm-started from the rank-1 solution instead of a random partition.
SA: simulated annealing [2] with Metropolis acceptance and geometric cooling. The initial temperature is calibrated so that typical worsening moves (of magnitude $\sim 2d$, where $d$ is the average degree) are accepted with probability $\sim 50\%$.

Graphs

Synthetic: We test on three standard graph families at sizes $n = 10{,}000$ to $n = 1{,}000{,}000$, each with 3 random seeds: 5-regular (every node has exactly 5 neighbors), toroidal grid (periodic 2D lattice), and Erdős–Rényi $G(n, 10/n)$.

Real-world: We include two families from standard benchmark collections:

Delaunay triangulations (n = 1K to 524K) from the SuiteSparse Matrix Collection [3]. These represent the edge structure of Delaunay triangulations of random point sets in 2D — planar, sparse, geometrically structured graphs that arise naturally in mesh generation and finite element methods.
Road networks (n = 1.09M and 1.38M) from the Stanford Network Analysis Project (SNAP) [4]. roadNet-PA and roadNet-TX are the full road networks of Pennsylvania and Texas, where nodes represent intersections and edges represent road segments. These are large, sparse, near-planar graphs with low average degree and pronounced geographic community structure.

Score ratios relative to Greedy (higher is better)

Green = best score. Blue = second best. Click column headers to sort.

The headline: Hybrid wins every time

Across all 45 instances, the hybrid solver (rank-1 + greedy) produces a better cut than greedy alone. The margin ranges from +0.3% on road networks to +3.9% on torus graphs. This is not a fluke — the spectral warm-start consistently guides greedy to a better local optimum.

Score comparison across scales (averaged over 3 seeds)

Hover over data points for exact values. Hybrid consistently outperforms Greedy.

Torus: Rank-1 is king

On toroidal grids, rank-1 beats every other method at every scale from 10K to 1M nodes. The spectral structure of the torus is perfectly captured by two eigenvectors. Greedy, which is blind to global structure, finds a solution 3.9% worse. Even SA — with its ability to escape local optima — falls 0.5–2.1% short.

Regular: Hybrid overtakes SA at scale

On 5-regular graphs, SA is the best method at $n \leq 100{,}000$, beating hybrid by about 1%. But at $n \geq 500{,}000$, the tables turn: SA's fixed iteration budget becomes insufficient, while hybrid's warm-start advantage holds. At $n = 1{,}000{,}000$, hybrid beats SA by 1.1%.

Real-world: Hybrid still helps

On Delaunay meshes (1K–524K nodes) and road networks (1M+ nodes), rank-1 alone is weak — only 66–80% of greedy. These graphs lack the clean spectral structure of torus or regular graphs. But the hybrid warm-start still adds 0.3–1.1% over greedy, demonstrating that even a mediocre spectral starting point beats a random one.

6. Honest Assessment

Where spectral methods shine

Structured graphs: On torus and lattice graphs, rank-1 is the best available method. No metaheuristic needed.
Warm-starting: Hybrid beats greedy in 100% of instances (45/45). Even when rank-1 quality is poor, the starting point is better than random.
Scale advantage over SA: At $n \geq 500{,}000$, SA's per-iteration budget runs out. Hybrid's quality advantage persists or grows.

Where spectral methods struggle

Unstructured graphs: On Erdős–Rényi graphs, rank-1 captures only 89–90% of greedy's score. The spectral structure is too diffuse.
Real-world graphs: Delaunay meshes and road networks are irregular. Rank-1 gets 66–80% of greedy.
Timing: The eigensolve (ARPACK) dominates at large $n$. Hybrid is 15–35× slower than pure greedy at $n = 1{,}000{,}000$. The value proposition is quality, not speed.

Time vs. score quality trade-off

Each dot is one instance. X-axis: wall-clock time (log scale). Y-axis: score as ratio to best known. Hover for details.

7. What This Enables

Rank-1 is a building block, not the final product. Three directions build on it:

7.1 Randomized rank-$r$

The rank-2 algorithm enumerates $O(n^3)$ candidates, which is intractable for large $n$. But preliminary experiments show that random sampling of just 0.1% of rank-2 candidates recovers 99.8% of the exact optimum. If this holds at larger $n$, it would make rank-2 practical at $n = 10{,}000+$ without any GPUs.

7.2 Rank-1 guided search for rank-2

Can the rank-1 solution help make rank-2 more efficient? The connection is not obvious — the two algorithms operate in different spaces — but there are several plausible mechanisms:

Score pruning. The rank-1 score provides a lower bound: any rank-2 candidate that scores below the rank-1 solution can be skipped. Combined with cheap upper-bound estimates from partial evaluation, this could prune large portions of the rank-2 candidate space.
Biased sampling. In the randomized rank-2 approach, instead of sampling candidate tuples uniformly, we can bias sampling toward candidates that produce partitions similar to the rank-1 solution. The intuition: rank-1 captures the dominant spectral direction, and the rank-2 optimum is likely a refinement of rank-1 rather than a completely different partition. Concretely, the rank-1 sweep identifies $n$ boundary events ordered by their phase offsets $\phi$. The boundaries with the largest score jumps indicate the most “important” hyperplanes. If we fix one of the three hyperplane indices in a rank-2 tuple to these critical boundaries and only sample the remaining two, we reduce the search from $O(n^3)$ to $O(n^2)$ with a spectral prior on where to look.
Coarse-to-fine refinement. The rank-1 solution classifies each node as “confident” (far from any partition boundary) or “uncertain” (near a boundary). We can freeze the confident nodes and run rank-2 only on the uncertain subset, reducing the effective $n$. For structured graphs where rank-1 is already strong, most nodes are confident, making rank-2 tractable on the residual subgraph.

These ideas remain to be validated experimentally. Whether a rigorous theoretical connection exists — e.g., a bound on how many rank-2 candidates fall “near” the rank-1 optimum — is an open question.

7.3 Spectral suitability prediction

Our results show a clear pattern: rank-1 excels on graphs with strong spectral structure (large eigengap, low effective rank) and struggles on spectrally diffuse graphs. The eigengap $\lambda_1 - \lambda_2$ and the ratio $\lambda_1 / \text{tr}(\mathbf{Q})$ may predict when spectral methods will outperform metaheuristics, enabling algorithm selection before solving.

8. Code & Reproducibility

All code and experiment scripts are available on GitHub:

github.com/barbara-su/MaxKCutParallel

Quick start

# Rank-1 + Hybrid (single CPU, any graph size)
python src/hybrid.py --q_path data/Q.npy --v_path data/V.npy --K 3

# Generate a 5-regular graph and solve
python src/graph_generators/gen_regular_random.py --n 100000 --degree 5 --rank 2 --seed 42
python src/hybrid.py --q_path instances/regular/Q_regular_100000_seed_42.npy \
                     --v_path instances/regular/V_regular_100000_seed_42.npy --K 3

# Compare all baselines
python src/baselines.py --q_path data/Q.npy --K 3 --methods random,greedy,sa

References

A. Frieze and M. Jerrum. “Improved Approximation Algorithms for MAX k-CUT and MAX BISECTION.” Algorithmica, 18(1):67–81, 1997.
S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi. “Optimization by Simulated Annealing.” Science, 220(4598):671–680, 1983.
T. A. Davis and Y. Hu. “The University of Florida Sparse Matrix Collection.” ACM Transactions on Mathematical Software, 38(1):1–25, 2011.
J. Leskovec and A. Krevl. “SNAP Datasets: Stanford Large Network Dataset Collection.” snap.stanford.edu/data, 2014.

Citation

@article{Stevens2025maxkcut, title = {Exploiting Low-Rank Structure in Max-$K$-Cut Problems}, author = {Stevens, Ria and Liao, Fangshuo and Su, Barbara and Li, Jianqiang and Kyrillidis, Anastasios}, journal = {arXiv preprint arXiv:2602.20376}, year = {2025} }