What Can 15 Obsolete GPUs Do for Combinatorial Optimization?

Exploiting low-rank structure in Max-3-Cut with commodity parallel hardware

Ria Stevens, Fangshuo Liao, Barbara Su, Jianqiang Li, Anastasios Kyrillidis — OptimaLab

Computer Science Department, Rice University

March 2026

Paper Code

1. The Hook: 15 GPUs Nobody Wants

In the Rice University Data Center, there sit four machines that most researchers would rather forget about. Between them they hold 15 NVIDIA P100 GPUs — the Pascal-generation accelerators that launched in 2016. No tensor cores. No BF16 support. Two of the machines run PowerPC ppc64le, meaning half the Python ecosystem quietly refuses to install. The newest GPU in this collection is nearly a decade old.

Our question was simple: can these obsolete machines, working together, solve hard combinatorial optimization problems competitively? The answer turns out to be yes — but only if you pick an algorithm that matches what the hardware can actually do. Instead of running a monolithic SDP solver on a single expensive node, we exploit the low-rank structure of quadratic problems to decompose the problem into millions of independent candidate evaluations. Each evaluation is a tiny matrix operation — exactly the workload a P100 was designed for.

Machine	GPUs	Architecture	RAM	Notes
`anton.k0`	4 × P100-SXM2 (16 GB)	x86_64	251 GB	Coordinator node
`anton.k1`	4 × P100-SXM2 (16 GB)	x86_64	125 GB	Reliable
`kp001`	4 × Tesla P100 (16 GB)	ppc64le	256 GB	PowerPC — limited package support
`kp002`	3 × Tesla P100 (16 GB)	ppc64le	256 GB	Salvage machine, power-capped at 220W

The result: our Rank-2 algorithm running across all 15 GPUs matches or exceeds SDP solution quality on quadratic form maximization, as in MaxCUT instances, while running 1.5–3× faster on most instances tested (torus at all sizes, regular graphs up to $n = 1000$). On torus graphs, it finds the provably optimal cut value every single time. And the total hardware cost? These machines were headed for e-waste.

2. The Max-3-Cut Problem

Given an undirected graph $G = (V, E)$ with $n$ vertices, the Max-3-Cut problem asks: partition the vertices into three groups $S_1, S_2, S_3$ to maximize the number of edges with endpoints in different groups. If you think of each group as a color, you want to color vertices so that as many edges as possible connect differently-colored vertices.

For Max-2-Cut, the two groups are encoded as $+1$ and $-1$ (the square roots of unity); for Max-3-Cut, we generalize to $\{1, \omega, \omega^2\}$, the cube roots of unity. This complex-valued encoding turns the combinatorial problem into a maximization over phases.

Formally, the problem is equivalent to maximizing the Hamiltonian

\text{Max-3-Cut}(G) = \max_{z \in \{1, \omega, \omega^2\}^n} \; \frac{1}{3} \, z^\dagger Q \, z

where $\omega = e^{2\pi i/3}$ is a primitive cube root of unity, and $Q$ is derived from the graph Laplacian $L$:

Q = \frac{3}{2}\left(D - \frac{1}{3}A - \frac{2}{3}D\right) = D - \frac{1}{2}A

Here $A$ is the adjacency matrix and $D$ is the degree matrix. Each vertex $i$ is assigned a label $z_i \in \{1, \omega, \omega^2\}$ corresponding to one of three groups. When two adjacent vertices have different labels, their contribution $\frac{1}{3}(1 - \text{Re}(\bar{z}_i z_j))$ equals 1 if they are in different groups and 0 if they are in the same group.

Why is it hard?

Max-3-Cut is NP-hard. The best polynomial-time approximation ratio known is 0.836, due to Frieze and Jerrum's complex-valued SDP generalization. (For comparison, the celebrated Goemans–Williamson result achieves 0.878 for $K = 2$, i.e., Max-Cut.) This ratio is achieved via semidefinite programming (SDP) relaxation followed by hyperplane rounding. The SDP approach lifts the problem into a space of $n \times n$ positive semidefinite matrices, solves a convex relaxation, and then rounds the solution back to a discrete assignment. This gives excellent solution quality, but the cost is steep: interior-point SDP solvers run in $\mathcal{O}(n^{3.5})$ time and require $\mathcal{O}(n^2)$ memory, and they are inherently sequential.

Why does it matter?

Max-$K$-Cut problems arise throughout engineering and science. In VLSI circuit design, partitioning circuit components minimizes inter-partition wiring. In machine learning, spectral clustering reduces to a relaxed Max-$K$-Cut. In quantum computing, Max-3-Cut is a standard benchmark for the Quantum Approximate Optimization Algorithm (QAOA), and classical solvers provide the baselines that quantum devices must beat. Finding faster classical methods raises the bar for quantum advantage claims.

3. The Low-Rank Idea

Why low-rank works: spectral concentration in real graphs

The key empirical observation is that real-world and structured graphs have spectra where a handful of top eigenvalues carry most of the cut-relevant information. Torus lattices, regular graphs, and many network models exhibit rapid eigenvalue decay: the largest two or three eigenvalues of the objective matrix $Q$ dominate the sum, while the remaining eigenvalues contribute relatively little to the optimal cut value. This spectral concentration means that solving the optimization over a low-rank approximation of $Q$ can yield solutions close to the true optimum — at dramatically lower computational cost.

Exploiting the structure: eigendecomposition

Let $Q = \sum_{j=1}^n \lambda_j v_j v_j^\dagger$ be the eigendecomposition of the objective matrix, with eigenvalues sorted in decreasing order. If the top $r$ eigenvalues capture most of the spectral energy — that is, $\sum_{j=1}^r \lambda_j \approx \sum_{j=1}^n \lambda_j$ — then we can approximate $Q \approx Q_r = \sum_{j=1}^r \lambda_j v_j v_j^\dagger$ and solve the optimization over the low-rank approximation instead.

Rank-1: Phase sweep

When $r = 1$, the objective becomes $z^\dagger (\lambda_1 v_1 v_1^\dagger) z = \lambda_1 |v_1^\dagger z|^2$. Each vertex $i$ contributes through its eigenvector component $v_{1,i}$, which has some angle $\theta_i$ in the complex plane. The optimal assignment is determined by how a global phase shift $\varphi$ aligns these angles with the three sectors corresponding to $\{1, \omega, \omega^2\}$.

The algorithm sweeps $\varphi$ over a discrete set of $n+1$ candidate values — one for each angle boundary where a vertex's assignment could change. For each candidate $\varphi$, we evaluate the cut value in $O(n)$ time. The total cost is $O(n^2)$ for the sweep plus $O(n)$ for the eigenvector computation (via Lanczos iteration).

Rank-2: Hypersurface arrangements

When $r = 2$, the objective is $z^\dagger Q_2 z = \lambda_1 |v_1^\dagger z|^2 + \lambda_2 |v_2^\dagger z|^2$. Now each vertex $i$ is described by a pair of angles $(\theta_{1,i}, \theta_{2,i})$ from the two eigenvectors. The decision boundaries become curves in a two-dimensional parameter space rather than points on a circle. In rank-1, the rounding depends on a single phase angle $\varphi$. In rank-2, the rounding depends on a direction $(\alpha, \beta)$ on a 3-dimensional sphere in $\mathbb{C}^2$, giving exponentially more candidate partitions to search.

The structure is embarrassingly parallel: each candidate can be evaluated independently. With $P$ processors, the wall-clock time drops by dividing the work by $1/P$. And crucially, the algorithm is exact for rank-2 matrices — the candidate set is guaranteed to contain the global maximizer.

When does low-rank work?

The quality of the low-rank approximation depends on the spectral energy ratio $\rho_r = 1 - \sum_{j=1}^r \lambda_j / \sum_{j=1}^n \lambda_j$, which measures the fraction of spectral energy not captured by the top $r$ eigenvalues. Our theory guarantees a multiplicative approximation factor of $(1 - \epsilon)$ where $\epsilon$ depends on $\rho_r$ and the perturbation structure.

When does low-rank capture enough? It depends entirely on how fast the eigenvalues decay. The thermometer below illustrates three representative spectral profiles. A nearly flat spectrum (like dense random graphs) spreads energy across many eigenvalues — low-rank captures little. A power-law decay concentrates more energy in the top few. An exactly low-rank spectrum (three dominant eigenvalues, rest negligible) is the ideal case: rank-3 captures essentially everything. Real graphs fall somewhere on this continuum. Torus and regular graphs are closer to the low-rank end; Erdős–Rényi and SBM are closer to the flat end.

Interactive: Spectral Energy Thermometer

Rank $r$: r = 2

Three illustrative spectral profiles. Blue bars are included at the current rank $r$; gray bars are excluded. The energy percentage shows how much of the total spectral energy is captured. Move the rank slider to see how quickly each profile saturates.

4. Engineering for Heterogeneous Hardware

The Rank-2 algorithm generates a polynomial number of candidate solutions that must each be evaluated. This number would take a single CPU core a lot of time, sometimes a couple of years even for graphs of $n = 1000$ nodes. But each evaluation is a dot product of length $n$, and dot products are exactly what GPUs were built for. The challenge is not the computation per candidate; it is orchestrating millions of parallel evaluations across 15 GPUs on 4 machines connected by a campus network.

Why not Ray? Why not Triton?

Our first instinct was to use Ray for distributed task scheduling. Ray is elegant and well-documented — and it does not build on ppc64le (the PowerPC architecture running on kp001 and kp002). We also considered Triton for custom GPU kernels, but Triton requires CUDA compute capability 7.0+, and the P100 is stuck at 6.0. We were left with the most portable tool in the PyTorch ecosystem: torch.multiprocessing.

The worker design

Each machine runs a worker.py process that receives its assigned slice of the candidate space via SSH, computes the cut values on its local GPUs using batched matrix operations, and writes the best result to a shared NFS directory. The coordinator on anton.k0 launches workers, monitors output files, and aggregates results. There is no inter-worker communication during execution — pure embarrassing parallelism.

Throughput-proportional splitting

The naive approach is to split candidates equally: each of the 15 GPUs gets 1/15 of the work. But our GPUs are not equal. The x86 machines (anton.k0, anton.k1) achieve roughly 20% higher throughput than the PowerPC machines (kp001, kp002), partly due to faster host-device transfer and partly due to better-optimized CUDA libraries. Equal splitting means the x86 machines finish early and sit idle while the PowerPC machines grind through their oversized share.

The fix is simple: benchmark each GPU with a small calibration batch (1000 candidates, ~2 seconds) before the main run, then split work proportional to measured throughput. This one change reduced our wall-clock time by 10.3× on a 750-node instance — the difference between 53 minutes and 5.1 minutes. The interactive diagram below shows the effect.

Interactive: GPU Cluster Work Distribution

Click the button to toggle between equal and throughput-proportional work distribution. Red borders indicate bottleneck machines. Progress bars show relative workload.

5. Results

We tested four methods on three graph families (3-regular, torus lattice, stochastic block model) at six sizes ($n \in \{250, 500, 750, 1000, 1250, 1500\}$), with three random seeds each — 54 instances in total. The methods are:

Rank-2 (15 GPU) — our low-rank algorithm on all 15 P100s
Greedy — iterative improvement starting from a random assignment
Random — best of 10,000 uniform random assignments
SDP — CVXPY with the SCS solver, run on a single x86 machine

All Rank-2 runs used throughput-proportional splitting across 15 GPUs.

Sortable Results Table

Family	n	Seed	Rank-2 Score	Rank-2 Time (s)	Greedy Score	Greedy Time (s)	Random Score	SDP Score	SDP Time (s)

Click column headers to sort. Use the filter buttons to show a single graph family. Green cells indicate the highest score for each instance. Light blue cells highlight the faster method between Rank-2 and SDP.

Key findings

Torus graphs: Rank-2 matches SDP perfectly at every size tested, consistently achieving the optimal cut value. This is expected — torus Laplacians have extremely concentrated spectra (top-2 eigenvalues capture >99% of energy), so the rank-2 approximation is nearly exact. At $n = 1008$, Rank-2 on 15 GPUs runs in 1,535s compared to SDP's 1,581s — comparable speed but with the advantage of easy parallelization to more machines.

3-Regular graphs: Rank-2 beats or matches SDP on instances with $n \geq 1000$. At $n = 1000$, the average Rank-2 score (7,043) exceeds the average SDP score (7,030). The gap widens at $n = 1500$, where Rank-2 averages 10,548 versus SDP's 10,494 — a modest but consistent advantage. This is noteworthy because regular graphs are not as spectrally concentrated as torus graphs, yet the low-rank approximation still captures enough structure to be competitive.

SBM (Stochastic Block Model): Here we must be honest — Rank-2 underperforms. On SBM graphs, Greedy consistently achieves the highest cut values, followed by SDP, with Rank-2 typically 5–7% below the best. The reason is clear from the spectral energy ratios: SBM graphs have relatively flat spectra, so the top-2 eigenvalues capture only a small fraction of the total energy. Low-rank methods simply cannot see enough of the graph structure.

Interactive: Runtime Scaling (Log-Log)

Log-log plot of runtime vs. graph size, averaged across seeds. Toggle lines on/off with the checkboxes. Hover for exact values.

Timing comparison

At small and medium sizes ($n \leq 1000$), Rank-2 is faster than SDP on regular graphs: at $n = 250$ it is 1.3–2× faster, at $n = 500$ roughly 3× faster, and at $n = 1000$ about 1.7× faster. However, the $\mathcal{O}(n^3)$ candidate enumeration cost catches up at larger sizes. At $n = 1500$ on regular graphs, Rank-2 takes about 7,100 seconds while SDP takes about 4,400 seconds — the crossover happens between $n = 1000$ and $n = 1500$. On torus graphs, Rank-2 remains competitive or faster at all sizes tested. Importantly, Rank-2 scales linearly with additional GPUs: adding just 5 more P100s would bring the $n = 1500$ regular time below SDP's, and the algorithm can distribute across any number of machines with no communication overhead.

Scaling to modern hardware

Everything above uses 2016-era P100 GPUs — the "obsolete hardware" story. But the same code, without any modifications, also runs on modern NVIDIA H200 GPUs. This validates that the software engineering is hardware-agnostic: the same worker.py and parallel_rank_r_dir_gpu_fullgpu.py execute on GPUs spanning three generations and two CPU architectures.

The table below shows rank-2 results on a single H200 node. All experiments use the same recursive GPU solver, K=3, and 32-bit precision. The GSet benchmark instances are standard testbeds for Max-Cut algorithms; here we solve the Max-3-Cut formulation on these graphs.

Instance	$n$	GPUs	Avg. batch (s)	Total time	Score (Max-3-Cut)
GSet-1	800	4×H200	46.5	5.6 min	41,939
GSet-1	800	6×H200	41.1	3.4 min	41,939
GSet-43	1,000	6×H200	49.5	6.5 min	22,965
GSet-22	2,000	6×H200	84.3	1.4 hrs	45,791
5-regular	3,600	8×H200	242.0	18.3 hrs	25,320

Rank-2 Max-3-Cut results on H200 GPUs. GSet scores are for the K=3 formulation and are not comparable to published Max-2-Cut (BQP) values, which optimize a different objective.

The standout result is the 5-regular graph with $n=3{,}600$ on 8 H200 GPUs: 18.3 hours to enumerate all 209 billion rank-2 candidate tuples. This establishes $n \approx 3{,}600$ as the practical frontier for exact rank-2 enumeration within a 24-hour budget on a single modern GPU node.

How do H200s compare to our P100 cluster? At $n=1{,}000$, 6 H200s complete the rank-2 search in 392 seconds, compared to 625 seconds on 15 P100s. Per-GPU, each H200 is roughly 4× faster — consistent with its higher memory bandwidth and FP32 throughput. The P100 story is not about per-GPU competitiveness; it's about accessibility. Universities have racks of P100s sitting idle. Our algorithm turns them into a useful combinatorial optimization cluster without needing modern hardware.

A glimpse at Rank-3

The rank-3 algorithm ($\mathcal{O}(n^7)$ candidates) is validated at small scale on H200s: $n=50$ completes in 24 seconds, $n=100$ in 381 seconds, both producing correct optimal scores. However, scaling is severe — extrapolating from observed rates, $n=1{,}000$ at rank-3 would require approximately 160,000 hours, placing full rank-3 enumeration far beyond any practical budget. This motivates the randomized sampling direction: even partial enumeration of rank-3 candidates can find high-quality solutions (an early checkpoint at $n=1{,}000$ achieved 86.6% of the rank-2 score after processing less than 1% of candidates).

6. Honest Assessment

Every method has its strengths and weaknesses. Rather than overselling our approach, we present a balanced comparison across five dimensions: solution quality, speed, parallelizability, scalability to large graphs, and theoretical guarantees. The radar chart below visualizes these trade-offs.

Interactive: Method Comparison Radar

Hover over a method name to highlight its polygon and dim others. Five axes: Quality (avg. cut ratio vs. best known), Speed (1/runtime), Parallelizability (speedup with more GPUs), Scalability (largest tractable n), Theory (approximation guarantee strength). Note: SDP provides an unconditional 0.836-approximation for any input matrix. Rank-2's guarantee is conditional—it bounds the approximation relative to the low-rank optimum, not the full-rank optimum, and depends on the spectral energy ratio.

When to use low-rank methods

Use Rank-2 when: your graph has concentrated spectral energy (torus lattices, regular graphs, structured networks), you have access to multiple GPUs (even old ones), and your graph is large enough ($n \geq 500$) that SDP's sequential nature becomes a bottleneck. Note: Rank-2's approximation guarantee is relative to the low-rank optimum; for unconditional guarantees, SDP is stronger.

Use SDP when: you need the best known approximation ratio regardless of runtime, your graph has flat spectral structure (e.g., SBM, Erdos-Renyi) where low-rank methods lose information, or $n$ is large enough ($n \geq 1500$ on regular graphs) that the $\mathcal{O}(n^3)$ candidate enumeration outweighs the parallelism advantage.

Use Greedy when: you need a quick, good-enough answer. Greedy is surprisingly strong on SBM graphs, often outperforming SDP. Its weakness is the lack of theoretical guarantees and poor performance on structured graphs where the optimal cut has a specific algebraic structure that Greedy's local moves cannot discover.

Limitations

The $\mathcal{O}(n^5)$ asymptotic complexity of Rank-2 is the elephant in the room. While parallelism hides this for moderate $n$, at $n = 2000$ the candidate count reaches $\sim 10^{13}$, which would require hundreds of GPUs. Rank-3 ($\mathcal{O}(n^7)$) is even worse. Future work on randomized candidate sampling — evaluating only a random subset of candidates — could dramatically reduce this cost while maintaining solution quality with high probability.

The second limitation is graph-family dependence. Low-rank methods are not universal: they excel on spectrally concentrated graphs and struggle on graphs with flat spectra. A practitioner must check the spectral energy ratio before deciding whether to use this approach. Fortunately, this check (computing a few top eigenvalues) is itself fast and parallelizable.

7. Code & Reproducibility

All code, data, and scripts to reproduce every experiment in this post are available on GitHub:

github.com/barbara-su/MaxKCutParallel

Quick start

# Clone and install
git clone https://github.com/barbara-su/MaxKCutParallel/tree/public-release.git
cd max-k-cut-parallel
pip install -r requirements.txt

# Run Rank-2 on a single GPU (torus, n=252)
python run_rank2.py --family torus --n 252 --gpus 1

# Run distributed across multiple machines
python coordinator.py --config cluster.yaml --family regular --n 1000

Hardware requirements

Minimum: 1 NVIDIA GPU with CUDA 10.2+ and 8 GB VRAM, PyTorch 1.12+
Recommended: 4+ GPUs for $n \geq 500$, 8+ GPUs for $n \geq 1000$
Commodity setup (this blog): 15 × P100 (16 GB) across 4 machines (2 × x86_64, 2 × ppc64le), connected via campus network
Modern setup: Single 8 × H200 node reaches $n = 3{,}600$ for rank-2 within 24 hours
No tensor cores required. Algorithm uses standard FP32 GEMM and linear algebra — works on any CUDA GPU from Pascal (2016) onward

Citation

@article{Stevens2025maxkcut, title = {Exploiting Low-Rank Structure in Max-$K$-Cut Problems}, author = {Stevens, Ria and Liao, Fangshuo and Su, Barbara and Li, Jianqiang and Kyrillidis, Anastasios}, journal = {arXiv preprint arXiv:2602.20376}, year = {2025} }