In the Rice University Data Center, there sit four machines that most researchers would rather forget about. Between them they hold 15 NVIDIA P100 GPUs — the Pascal-generation accelerators that launched in 2016. No tensor cores. No BF16 support. Two of the machines run PowerPC ppc64le, meaning half the Python ecosystem quietly refuses to install. The newest GPU in this collection is nearly a decade old.
Our question was simple: can these obsolete machines, working together, solve hard combinatorial optimization problems competitively? The answer turns out to be yes — but only if you pick an algorithm that matches what the hardware can actually do. Instead of running a monolithic SDP solver on a single expensive node, we exploit the low-rank structure of quadratic problems to decompose the problem into millions of independent candidate evaluations. Each evaluation is a tiny matrix operation — exactly the workload a P100 was designed for.
| Machine | GPUs | Architecture | RAM | Notes |
|---|---|---|---|---|
anton.k0 |
4 × P100-SXM2 (16 GB) | x86_64 | 251 GB | Coordinator node |
anton.k1 |
4 × P100-SXM2 (16 GB) | x86_64 | 125 GB | Reliable |
kp001 |
4 × Tesla P100 (16 GB) | ppc64le | 256 GB | PowerPC — limited package support |
kp002 |
3 × Tesla P100 (16 GB) | ppc64le | 256 GB | Salvage machine, power-capped at 220W |
The result: our Rank-2 algorithm running across all 15 GPUs matches or exceeds SDP solution quality on quadratic form maximization, as in MaxCUT instances, while running 1.5–3× faster. On torus graphs, it finds the provably optimal cut value every single time. And the total hardware cost? These machines were headed for e-waste.
Given an undirected graph \(G = (V, E)\) with \(n\) vertices, the Max-3-Cut problem asks: partition the vertices into three groups \(S_1, S_2, S_3\) to maximize the number of edges with endpoints in different groups. If you think of each group as a color, you want to color vertices so that as many edges as possible connect differently-colored vertices.
This is equivalent to maximizing the Hamiltonian
where \(\omega = e^{2\pi i/3}\) is a primitive cube root of unity, and \(Q\) is derived from the graph Laplacian \(L\):
Here \(A\) is the adjacency matrix and \(D\) is the degree matrix. Each vertex \(i\) is assigned a label \(z_i \in \{1, \omega, \omega^2\}\) corresponding to one of three groups. When two adjacent vertices have different labels, their contribution \(\frac{1}{3}(1 - \text{Re}(\bar{z}_i z_j))\) equals 1 if they are in different groups and 0 if they are in the same group.
Max-3-Cut is NP-hard. The best polynomial-time approximation ratio known is 0.836, achieved via semidefinite programming (SDP) relaxation followed by hyperplane rounding. The SDP approach lifts the problem into a space of \(n \times n\) positive semidefinite matrices, solves a convex relaxation, and then rounds the solution back to a discrete assignment. This gives excellent solution quality, but the cost is steep: interior-point SDP solvers run in \(\mathcal{O}(n^{3.5})\) time and require \(\mathcal{O}(n^2)\) memory, and they are inherently sequential.
Max-\(K\)-Cut problems arise throughout engineering and science. In VLSI circuit design, partitioning circuit components minimizes inter-partition wiring. In machine learning, spectral clustering reduces to a relaxed Max-\(K\)-Cut. In quantum computing, Max-3-Cut is a standard benchmark for the Quantum Approximate Optimization Algorithm (QAOA), and classical solvers provide the baselines that quantum devices must beat. Finding faster classical methods raises the bar for quantum advantage claims.
Let \(Q = \sum_{j=1}^n \lambda_j v_j v_j^\dagger\) be the eigendecomposition of the objective matrix, with eigenvalues sorted in decreasing order. If the top \(r\) eigenvalues capture most of the spectral energy — that is, \(\sum_{j=1}^r \lambda_j \approx \sum_{j=1}^n \lambda_j\) — then we can approximate \(Q \approx Q_r = \sum_{j=1}^r \lambda_j v_j v_j^\dagger\) and solve the optimization over the low-rank approximation instead.
When \(r = 1\), the objective becomes \(z^\dagger (\lambda_1 v_1 v_1^\dagger) z = \lambda_1 |v_1^\dagger z|^2\). Each vertex \(i\) contributes through its eigenvector component \(v_{1,i}\), which has some angle \(\theta_i\) in the complex plane. The optimal assignment is determined by how a global phase shift \(\varphi\) aligns these angles with the three sectors corresponding to \(\{1, \omega, \omega^2\}\).
The algorithm sweeps \(\varphi\) over a discrete set of \(n+1\) candidate values — one for each angle boundary where a vertex's assignment could change. For each candidate \(\varphi\), we evaluate the cut value in \(O(n)\) time. The total cost is \(O(n^2)\) for the sweep plus \(O(n)\) for the eigenvector computation (via Lanczos iteration).
When \(r = 2\), the objective is \(z^\dagger Q_2 z = \lambda_1 |v_1^\dagger z|^2 + \lambda_2 |v_2^\dagger z|^2\). Now each vertex \(i\) is described by a pair of angles \((\theta_{1,i}, \theta_{2,i})\) from the two eigenvectors. The decision boundaries become curves in a two-dimensional parameter space rather than points on a circle.
The structure is embarrassingly parallel: each candidate can be evaluated independently. With \(P\) processors, the wall-clock time drops by dividing the work by \(1/P\). And crucially, the algorithm is exact for rank-2 matrices — the candidate set is guaranteed to contain the global maximizer.
The quality of the low-rank approximation depends on the spectral energy ratio \(\rho_r = 1 - \sum_{j=1}^r \lambda_j / \sum_{j=1}^n \lambda_j\), which measures the fraction of spectral energy not captured by the top \(r\) eigenvalues. Our theory guarantees a multiplicative approximation factor of \((1 - \epsilon)\) where \(\epsilon\) depends on \(\rho_r\) and the perturbation structure.
The Rank-2 algorithm generates a polynomial number of candidate solutions that must each be evaluated. This number would take a single CPU core a lot of time, sometimes a couple of years even for graphs of \(n = 1000\) nodes. But each evaluation is a dot product of length \(n\), and dot products are exactly what GPUs were built for. The challenge is not the computation per candidate; it is orchestrating millions of parallel evaluations across 15 GPUs on 4 machines connected by a campus network.
Our first instinct was to use Ray for distributed task scheduling. Ray is elegant and well-documented — and it does not build on ppc64le (the PowerPC architecture running on kp001 and kp002). We also considered Triton for custom GPU kernels, but Triton requires CUDA compute capability 7.0+, and the P100 is stuck at 6.0. We were left with the most portable tool in the PyTorch ecosystem: torch.multiprocessing.
Each machine runs a worker.py process that receives its assigned slice of the candidate space via SSH, computes the cut values on its local GPUs using batched matrix operations, and writes the best result to a shared NFS directory. The coordinator on anton.k0 launches workers, monitors output files, and aggregates results. There is no inter-worker communication during execution — pure embarrassing parallelism.
The naive approach is to split candidates equally: each of the 15 GPUs gets 1/15 of the work. But our GPUs are not equal. The x86 machines (anton.k0, anton.k1) achieve roughly 20% higher throughput than the PowerPC machines (kp001, kp002), partly due to faster host-device transfer and partly due to better-optimized CUDA libraries. Equal splitting means the x86 machines finish early and sit idle while the PowerPC machines grind through their oversized share.
The fix is simple: benchmark each GPU with a small calibration batch (1000 candidates, ~2 seconds) before the main run, then split work proportional to measured throughput. This one change reduced our wall-clock time by 10.3× on a 750-node instance — the difference between 53 minutes and 5.1 minutes. The interactive diagram below shows the effect.
We tested four methods on three graph families (3-regular, torus lattice, stochastic block model) at six sizes (\(n \in \{250, 500, 750, 1000, 1250, 1500\}\)), with three random seeds each — 54 instances in total. The methods are:
All Rank-2 runs used throughput-proportional splitting across 15 GPUs.
| Family | n | Seed | Rank-2 Score | Rank-2 Time (s) | Greedy Score | Greedy Time (s) | Random Score | SDP Score | SDP Time (s) |
|---|
Torus graphs: Rank-2 matches SDP perfectly at every size tested, consistently achieving the optimal cut value. This is expected — torus Laplacians have extremely concentrated spectra (top-2 eigenvalues capture >99% of energy), so the rank-2 approximation is nearly exact. At \(n = 1008\), Rank-2 on 15 GPUs runs in 1,535s compared to SDP's 1,581s — comparable speed but with the advantage of easy parallelization to more machines.
3-Regular graphs: Rank-2 beats or matches SDP on instances with \(n \geq 1000\). At \(n = 1000\), the average Rank-2 score (7,043) exceeds the average SDP score (7,030). The gap widens at \(n = 1500\), where Rank-2 averages 10,548 versus SDP's 10,494 — a modest but consistent advantage. This is noteworthy because regular graphs are not as spectrally concentrated as torus graphs, yet the low-rank approximation still captures enough structure to be competitive.
SBM (Stochastic Block Model): Here we must be honest — Rank-2 underperforms. On SBM graphs, Greedy consistently achieves the highest cut values, followed by SDP, with Rank-2 typically 5–7% below the best. The reason is clear from the spectral energy ratios: SBM graphs have relatively flat spectra, so the top-2 eigenvalues capture only a small fraction of the total energy. Low-rank methods simply cannot see enough of the graph structure.
At small sizes (\(n \leq 500\)), SDP is faster than Rank-2 because the \(O(n^5/P)\) cost dominates even with 15 GPUs. But the crossover happens around \(n = 750\text{--}1000\) for regular graphs. At \(n = 1500\), Rank-2 takes about 7,100 seconds while SDP takes about 4,400 seconds on regular graphs — SDP is still faster here. However, Rank-2 scales linearly with additional GPUs: adding just 5 more P100s would bring it below SDP's time, and the algorithm can distribute across any number of machines with no communication overhead.
Everything above uses 2016-era P100 GPUs — the "obsolete hardware" story. But the same code, without any modifications, also runs on modern NVIDIA H200 GPUs. This validates that the software engineering is hardware-agnostic: the same worker.py and parallel_rank_r_dir_gpu_fullgpu.py execute on GPUs spanning three generations and two CPU architectures.
The table below shows rank-2 results on a single H200 node. All experiments use the same recursive GPU solver, K=3, and 32-bit precision. The GSet benchmark instances are standard testbeds for Max-Cut algorithms; here we solve the Max-3-Cut formulation on these graphs.
| Instance | \(n\) | GPUs | Avg. batch (s) | Total time | Score (Max-3-Cut) |
|---|---|---|---|---|---|
| GSet-1 | 800 | 4×H200 | 46.5 | 5.6 min | 41,939 |
| GSet-1 | 800 | 6×H200 | 41.1 | 3.4 min | 41,939 |
| GSet-43 | 1,000 | 6×H200 | 49.5 | 6.5 min | 22,965 |
| GSet-22 | 2,000 | 6×H200 | 84.3 | 1.4 hrs | 45,791 |
| 5-regular | 3,600 | 8×H200 | 242.0 | 18.3 hrs | 25,320 |
Rank-2 Max-3-Cut results on H200 GPUs. GSet scores are for the K=3 formulation and are not comparable to published Max-2-Cut (BQP) values, which optimize a different objective.
The standout result is the 5-regular graph with \(n=3{,}600\) on 8 H200 GPUs: 18.3 hours to enumerate all 209 billion rank-2 candidate tuples. This establishes \(n \approx 3{,}600\) as the practical frontier for exact rank-2 enumeration within a 24-hour budget on a single modern GPU node.
How do H200s compare to our P100 cluster? At \(n=1{,}000\), 6 H200s complete the rank-2 search in 392 seconds, compared to 625 seconds on 15 P100s. Per-GPU, each H200 is roughly 4× faster — consistent with its higher memory bandwidth and FP32 throughput. The P100 story is not about per-GPU competitiveness; it's about accessibility. Universities have racks of P100s sitting idle. Our algorithm turns them into a useful combinatorial optimization cluster without needing modern hardware.
The rank-3 algorithm (\(\mathcal{O}(n^7)\) candidates) is validated at small scale on H200s: \(n=50\) completes in 24 seconds, \(n=100\) in 381 seconds, both producing correct optimal scores. However, scaling is severe — extrapolating from observed rates, \(n=1{,}000\) at rank-3 would require approximately 160,000 hours, placing full rank-3 enumeration far beyond any practical budget. This motivates the randomized sampling direction: even partial enumeration of rank-3 candidates can find high-quality solutions (an early checkpoint at \(n=1{,}000\) achieved 86.6% of the rank-2 score after processing less than 1% of candidates).
Every method has its strengths and weaknesses. Rather than overselling our approach, we present a balanced comparison across five dimensions: solution quality, speed, parallelizability, scalability to large graphs, and theoretical guarantees. The radar chart below visualizes these trade-offs.
Use Rank-2 when: your graph has concentrated spectral energy (torus lattices, regular graphs, structured networks), you have access to multiple GPUs (even old ones), you need provable guarantees, and your graph is large enough (\(n \geq 500\)) that SDP's sequential nature becomes a bottleneck.
Use SDP when: you need the best known approximation ratio regardless of runtime, your graph is small (\(n \leq 500\)), or the graph has flat spectral structure (e.g., SBM, Erdos-Renyi) where low-rank methods lose information.
Use Greedy when: you need a quick, good-enough answer. Greedy is surprisingly strong on SBM graphs, often outperforming SDP. Its weakness is the lack of theoretical guarantees and poor performance on structured graphs where the optimal cut has a specific algebraic structure that Greedy's local moves cannot discover.
The \(\mathcal{O}(n^5)\) asymptotic complexity of Rank-2 is the elephant in the room. While parallelism hides this for moderate \(n\), at \(n = 2000\) the candidate count reaches \(\sim 10^{13}\), which would require hundreds of GPUs. Rank-3 (\(\mathcal{O}(n^7)\)) is even worse. Future work on randomized candidate sampling — evaluating only a random subset of candidates — could dramatically reduce this cost while maintaining solution quality with high probability.
The second limitation is graph-family dependence. Low-rank methods are not universal: they excel on spectrally concentrated graphs and struggle on graphs with flat spectra. A practitioner must check the spectral energy ratio before deciding whether to use this approach. Fortunately, this check (computing a few top eigenvalues) is itself fast and parallelizable.
All code, data, and scripts to reproduce every experiment in this post are available on GitHub:
github.com/barbara-su/max-k-cut-parallel
# Clone and install
git clone https://github.com/barbara-su/max-k-cut-parallel.git
cd max-k-cut-parallel
pip install -r requirements.txt
# Run Rank-2 on a single GPU (torus, n=252)
python run_rank2.py --family torus --n 252 --gpus 1
# Run distributed across multiple machines
python coordinator.py --config cluster.yaml --family regular --n 1000