Rank-1 as a Building Block for Million-Node Max-3-Cut

Incremental scoring, 2-eigenvector rounding, and hybrid warm-starts at extreme scale
Ria Stevens, Fangshuo Liao, Barbara Su, Jianqiang Li, Anastasios Kyrillidis
Department of Computer Science, Rice University
April 2026

1. From GPUs to a Single CPU

In our previous post, we threw 15 GPUs at Max-3-Cut and pushed the rank-2 algorithm to \(n = 3{,}600\) nodes. The rank-2 approach is powerful — it enumerates \(O(n^3)\) candidate solutions and provably contains the global optimizer when the objective has rank 2 — but its cubic scaling hits a wall around a few thousand nodes, even with parallelism.

What if we drop one rank? The rank-1 algorithm reduces the candidate set to \(O(n)\) and needs no GPU at all. The trade-off seems clear: less computation, worse quality. But three algorithmic insights change this picture dramatically:

  1. Incremental scoring — exploit sparsity to evaluate candidates in \(O(n \cdot d)\) total instead of \(O(n^2)\)
  2. 2-eigenvector complex rounding — fix a degeneracy that was losing one-third of the solution space
  3. Hybrid warm-start — use the rank-1 solution to seed greedy local search

Together, these let us solve million-node graphs in minutes on a laptop-class CPU — and the hybrid solver beats greedy local search on every single instance we tested.

Rank-1 as a standalone solver and as an initializer

A key theme of this work is that rank-1 serves a dual role. As a standalone solver, the rank-1 phase sweep produces a complete partition by rounding eigenvector phases to roots of unity. On graphs with strong spectral structure — torus, lattice, and other highly regular families — this solution is already optimal or near-optimal. On torus graphs, rank-1 beats every other method we tested, including simulated annealing.

But rank-1 also serves as an initializer for other algorithms. Local search methods like greedy hill-climbing are fast but trapped by their starting point: from a random initialization, they converge to whatever local optimum is nearest. The rank-1 solution provides a spectrally informed starting point that encodes global structure of the graph — information that pure local search is blind to. We call this combination Hybrid: run rank-1 first, then run greedy local search starting from the rank-1 partition instead of a random one.

The distinction matters because the hybrid solver is not simply “rank-1 but better.” It is a different algorithm that composes spectral global information (rank-1) with local optimization (greedy). The two components have complementary strengths: rank-1 captures global structure but ignores local details; greedy refines local details but is blind to global structure. Together, they consistently outperform either method alone.

1M
nodes solved
45/45
hybrid beats greedy
0 GPUs
required
~10 lines
of new code

2. The \(O(n^2)\) Bottleneck

Recall the setup. Given a graph Laplacian \(\mathbf{Q}\) and its top eigenvector \(\mathbf{v}\), the rank-1 phase sweep constructs an initial partition \(\mathbf{z}\) from \(\mathbf{v}\)'s phases, then sweeps through \(n\) boundary points, flipping one node at a time. At each step, we need the new score \(\mathbf{z}^\dagger \mathbf{Q} \mathbf{z}\).

The naive approach recomputes this score from scratch: multiply \(\mathbf{Q}\mathbf{z}\), then dot with \(\mathbf{z}\). Each evaluation costs \(O(\text{nnz})\) for a sparse matrix with nnz non-zeros. Over \(n\) sweeps, the total is \(O(n \cdot \text{nnz})\). For a 5-regular graph (where nnz \(= 5n\)), this gives \(O(n^2)\) — manageable at \(n = 1{,}000\), but 17 hours at \(n = 100{,}000\).

The key observation

When one node \(i\) flips from \(z_\text{old}\) to \(z_\text{new}\), the score change has a closed-form expression:

$$\Delta = 2 \cdot \text{Re}\big(\overline{\delta z} \cdot (\mathbf{Q}\mathbf{z})_i\big) + |\delta z|^2 \cdot Q_{ii}$$

where \(\delta z = z_\text{new} - z_\text{old}\). This is \(O(1)\) per candidate. The catch is maintaining the cached product \(\mathbf{Q}\mathbf{z}\), which requires updating the rows affected by the flip: \((\mathbf{Q}\mathbf{z})_j \mathrel{+}= Q_{ji} \cdot \delta z\) for each neighbor \(j\) of node \(i\).

For a graph with average degree \(d\), each update touches \(d\) entries. Over \(n\) flips, the total cost is \(O(n \cdot d)\) instead of \(O(n^2)\). On a 5-regular graph, that is a \(n/5\) speedup — a factor of 20,000 at \(n = 100{,}000\).

The sparse update in code

The critical implementation detail: directly index into the CSC sparse matrix structure. A seemingly equivalent approach — Q.getcol(i).toarray() — allocates an \(n\)-dimensional vector on every call, turning \(O(d)\) into \(O(n)\).

# The wrong way: O(n) per update due to dense allocation col = Q_csc.getcol(i).toarray().flatten() Qz += col * dz # The right way: O(degree) per update via direct CSC access start = Q_csc.indptr[i] end = Q_csc.indptr[i + 1] rows = Q_csc.indices[start:end] vals = Q_csc.data[start:end] Qz[rows] += vals * dz

3. Fixing the Rounding: Two Eigenvectors for Three Partitions

For Max-3-Cut, we need to assign each node to one of three partitions, represented by the cube roots of unity: \(\{1, \omega, \omega^2\}\) where \(\omega = e^{2\pi i/3}\). The rank-1 algorithm maps eigenvector entries to partitions based on their complex phase angles.

The problem: graph Laplacians are real symmetric, so their eigenvectors are real-valued. A real number has a complex phase of either 0 or \(\pi\), mapping to at most 2 of the 3 partitions. One-third of the solution space is unreachable.

2-eigenvector complex rounding

Instead of using a single real eigenvector, combine the top two:

$$q_i = v_1(i) + j \cdot v_2(i)$$

Now \(\text{arg}(q_i)\) spans the full \([0, 2\pi)\) circle, and all three partitions are reachable. The cost is one extra eigenvalue computation — negligible compared to the sweep.

The effect on torus graphs is striking: with the fix, rank-1 finds the best solution among all methods, including simulated annealing. Without it, rank-1 is limited to a bipartition of a 3-colorable structure.


4. Scaling to a Million Nodes

With incremental scoring, the sweep phase scales linearly in \(n \cdot d\). The bottleneck shifts to the eigenvalue computation (ARPACK via scipy.sparse.linalg.eigsh), which scales roughly as \(O(\text{nnz} \cdot k_\text{iter})\).

Rank-1 timing breakdown on 5-regular graphs
Sweep time grows linearly with n. Eigensolve dominates at large n due to ARPACK iteration count.

5. Results: 45 Instances from 10K to 1M Nodes

We tested five methods on 45 graph instances across five families: 5-regular, toroidal grid, Erdős–Rényi, Delaunay triangulations, and road networks.

Methods

Graphs

Synthetic: We test on three standard graph families at sizes \(n = 10{,}000\) to \(n = 1{,}000{,}000\), each with 3 random seeds: 5-regular (every node has exactly 5 neighbors), toroidal grid (periodic 2D lattice), and Erdős–Rényi \(G(n, 10/n)\).

Real-world: We include two families from standard benchmark collections:

Score ratios relative to Greedy (higher is better)
Green = best score. Blue = second best. Click column headers to sort.

The headline: Hybrid wins every time

Across all 45 instances, the hybrid solver (rank-1 + greedy) produces a better cut than greedy alone. The margin ranges from +0.3% on road networks to +3.9% on torus graphs. This is not a fluke — the spectral warm-start consistently guides greedy to a better local optimum.

Score comparison across scales (averaged over 3 seeds)
Hover over data points for exact values. Hybrid consistently outperforms Greedy.

Torus: Rank-1 is king

On toroidal grids, rank-1 beats every other method at every scale from 10K to 1M nodes. The spectral structure of the torus is perfectly captured by two eigenvectors. Greedy, which is blind to global structure, finds a solution 3.9% worse. Even SA — with its ability to escape local optima — falls 0.5–2.1% short.

Regular: Hybrid overtakes SA at scale

On 5-regular graphs, SA is the best method at \(n \leq 100{,}000\), beating hybrid by about 1%. But at \(n \geq 500{,}000\), the tables turn: SA's fixed iteration budget becomes insufficient, while hybrid's warm-start advantage holds. At \(n = 1{,}000{,}000\), hybrid beats SA by 1.1%.

Real-world: Hybrid still helps

On Delaunay meshes (1K–524K nodes) and road networks (1M+ nodes), rank-1 alone is weak — only 66–80% of greedy. These graphs lack the clean spectral structure of torus or regular graphs. But the hybrid warm-start still adds 0.3–1.1% over greedy, demonstrating that even a mediocre spectral starting point beats a random one.


6. Honest Assessment

Where spectral methods shine

Where spectral methods struggle

Time vs. score quality trade-off
Each dot is one instance. X-axis: wall-clock time (log scale). Y-axis: score as ratio to best known. Hover for details.

7. What This Enables

Rank-1 is a building block, not the final product. Three directions build on it:

7.1 Randomized rank-\(r\)

The rank-2 algorithm enumerates \(O(n^3)\) candidates, which is intractable for large \(n\). But preliminary experiments show that random sampling of just 0.1% of rank-2 candidates recovers 99.8% of the exact optimum. If this holds at larger \(n\), it would make rank-2 practical at \(n = 10{,}000+\) without any GPUs.

7.2 Rank-1 guided search for rank-2

Can the rank-1 solution help make rank-2 more efficient? The connection is not obvious — the two algorithms operate in different spaces — but there are several plausible mechanisms:

These ideas remain to be validated experimentally. Whether a rigorous theoretical connection exists — e.g., a bound on how many rank-2 candidates fall “near” the rank-1 optimum — is an open question.

7.3 Spectral suitability prediction

Our results show a clear pattern: rank-1 excels on graphs with strong spectral structure (large eigengap, low effective rank) and struggles on spectrally diffuse graphs. The eigengap \(\lambda_1 - \lambda_2\) and the ratio \(\lambda_1 / \text{tr}(\mathbf{Q})\) may predict when spectral methods will outperform metaheuristics, enabling algorithm selection before solving.


8. Code & Reproducibility

All code and experiment scripts are available on GitHub:

github.com/barbara-su/MaxKCutParallel

Quick start

# Rank-1 + Hybrid (single CPU, any graph size)
python src/hybrid.py --q_path data/Q.npy --v_path data/V.npy --K 3

# Generate a 5-regular graph and solve
python src/graph_generators/gen_regular_random.py --n 100000 --degree 5 --rank 2 --seed 42
python src/hybrid.py --q_path instances/regular/Q_regular_100000_seed_42.npy \
                     --v_path instances/regular/V_regular_100000_seed_42.npy --K 3

# Compare all baselines
python src/baselines.py --q_path data/Q.npy --K 3 --methods random,greedy,sa

References

  1. A. Frieze and M. Jerrum. “Improved Approximation Algorithms for MAX k-CUT and MAX BISECTION.” Algorithmica, 18(1):67–81, 1997.
  2. S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi. “Optimization by Simulated Annealing.” Science, 220(4598):671–680, 1983.
  3. T. A. Davis and Y. Hu. “The University of Florida Sparse Matrix Collection.” ACM Transactions on Mathematical Software, 38(1):1–25, 2011.
  4. J. Leskovec and A. Krevl. “SNAP Datasets: Stanford Large Network Dataset Collection.” snap.stanford.edu/data, 2014.

Citation

@article{Stevens2025maxkcut, title = {Exploiting Low-Rank Structure in Max-$K$-Cut Problems}, author = {Stevens, Ria and Liao, Fangshuo and Su, Barbara and Li, Jianqiang and Kyrillidis, Anastasios}, journal = {arXiv preprint arXiv:2602.20376}, year = {2025} }