What Can 15 Obsolete GPUs Do for Combinatorial Optimization?

Exploiting low-rank structure in Max-3-Cut with commodity parallel hardware
Ria Stevens, Fangshuo Liao, Barbara Su, Jianqiang Li, Anastasios Kyrillidis
Computer Science Department, Rice University
March 2026

1. The Hook: 15 GPUs Nobody Wants

In the Rice University Data Center, there sit four machines that most researchers would rather forget about. Between them they hold 15 NVIDIA P100 GPUs — the Pascal-generation accelerators that launched in 2016. No tensor cores. No BF16 support. Two of the machines run PowerPC ppc64le, meaning half the Python ecosystem quietly refuses to install. The newest GPU in this collection is nearly a decade old.

Our question was simple: can these obsolete machines, working together, solve hard combinatorial optimization problems competitively? The answer turns out to be yes — but only if you pick an algorithm that matches what the hardware can actually do. Instead of running a monolithic SDP solver on a single expensive node, we exploit the low-rank structure of quadratic problems to decompose the problem into millions of independent candidate evaluations. Each evaluation is a tiny matrix operation — exactly the workload a P100 was designed for.

Machine GPUs Architecture RAM Notes
anton.k0 4 × P100-SXM2 (16 GB) x86_64 251 GB Coordinator node
anton.k1 4 × P100-SXM2 (16 GB) x86_64 125 GB Reliable
kp001 4 × Tesla P100 (16 GB) ppc64le 256 GB PowerPC — limited package support
kp002 3 × Tesla P100 (16 GB) ppc64le 256 GB Salvage machine, power-capped at 220W

The result: our Rank-2 algorithm running across all 15 GPUs matches or exceeds SDP solution quality on quadratic form maximization, as in MaxCUT instances, while running 1.5–3× faster. On torus graphs, it finds the provably optimal cut value every single time. And the total hardware cost? These machines were headed for e-waste.


2. The Max-3-Cut Problem

Given an undirected graph \(G = (V, E)\) with \(n\) vertices, the Max-3-Cut problem asks: partition the vertices into three groups \(S_1, S_2, S_3\) to maximize the number of edges with endpoints in different groups. If you think of each group as a color, you want to color vertices so that as many edges as possible connect differently-colored vertices.

This is equivalent to maximizing the Hamiltonian

\[ \text{Max-3-Cut}(G) = \max_{z \in \{1, \omega, \omega^2\}^n} \; \frac{1}{3} \, z^\dagger Q \, z \]

where \(\omega = e^{2\pi i/3}\) is a primitive cube root of unity, and \(Q\) is derived from the graph Laplacian \(L\):

\[ Q = \frac{3}{2}\left(D - \frac{1}{3}A - \frac{2}{3}D\right) = D - \frac{1}{2}A \]

Here \(A\) is the adjacency matrix and \(D\) is the degree matrix. Each vertex \(i\) is assigned a label \(z_i \in \{1, \omega, \omega^2\}\) corresponding to one of three groups. When two adjacent vertices have different labels, their contribution \(\frac{1}{3}(1 - \text{Re}(\bar{z}_i z_j))\) equals 1 if they are in different groups and 0 if they are in the same group.

Why is it hard?

Max-3-Cut is NP-hard. The best polynomial-time approximation ratio known is 0.836, achieved via semidefinite programming (SDP) relaxation followed by hyperplane rounding. The SDP approach lifts the problem into a space of \(n \times n\) positive semidefinite matrices, solves a convex relaxation, and then rounds the solution back to a discrete assignment. This gives excellent solution quality, but the cost is steep: interior-point SDP solvers run in \(\mathcal{O}(n^{3.5})\) time and require \(\mathcal{O}(n^2)\) memory, and they are inherently sequential.

Why does it matter?

Max-\(K\)-Cut problems arise throughout engineering and science. In VLSI circuit design, partitioning circuit components minimizes inter-partition wiring. In machine learning, spectral clustering reduces to a relaxed Max-\(K\)-Cut. In quantum computing, Max-3-Cut is a standard benchmark for the Quantum Approximate Optimization Algorithm (QAOA), and classical solvers provide the baselines that quantum devices must beat. Finding faster classical methods raises the bar for quantum advantage claims.


3. The Low-Rank Idea

Let \(Q = \sum_{j=1}^n \lambda_j v_j v_j^\dagger\) be the eigendecomposition of the objective matrix, with eigenvalues sorted in decreasing order. If the top \(r\) eigenvalues capture most of the spectral energy — that is, \(\sum_{j=1}^r \lambda_j \approx \sum_{j=1}^n \lambda_j\) — then we can approximate \(Q \approx Q_r = \sum_{j=1}^r \lambda_j v_j v_j^\dagger\) and solve the optimization over the low-rank approximation instead.

Rank-1: Phase sweep

When \(r = 1\), the objective becomes \(z^\dagger (\lambda_1 v_1 v_1^\dagger) z = \lambda_1 |v_1^\dagger z|^2\). Each vertex \(i\) contributes through its eigenvector component \(v_{1,i}\), which has some angle \(\theta_i\) in the complex plane. The optimal assignment is determined by how a global phase shift \(\varphi\) aligns these angles with the three sectors corresponding to \(\{1, \omega, \omega^2\}\).

The algorithm sweeps \(\varphi\) over a discrete set of \(n+1\) candidate values — one for each angle boundary where a vertex's assignment could change. For each candidate \(\varphi\), we evaluate the cut value in \(O(n)\) time. The total cost is \(O(n^2)\) for the sweep plus \(O(n)\) for the eigenvector computation (via Lanczos iteration).

Rank-2: Hypersurface arrangements

When \(r = 2\), the objective is \(z^\dagger Q_2 z = \lambda_1 |v_1^\dagger z|^2 + \lambda_2 |v_2^\dagger z|^2\). Now each vertex \(i\) is described by a pair of angles \((\theta_{1,i}, \theta_{2,i})\) from the two eigenvectors. The decision boundaries become curves in a two-dimensional parameter space rather than points on a circle.

The structure is embarrassingly parallel: each candidate can be evaluated independently. With \(P\) processors, the wall-clock time drops by dividing the work by \(1/P\). And crucially, the algorithm is exact for rank-2 matrices — the candidate set is guaranteed to contain the global maximizer.

When does low-rank work?

The quality of the low-rank approximation depends on the spectral energy ratio \(\rho_r = 1 - \sum_{j=1}^r \lambda_j / \sum_{j=1}^n \lambda_j\), which measures the fraction of spectral energy not captured by the top \(r\) eigenvalues. Our theory guarantees a multiplicative approximation factor of \((1 - \epsilon)\) where \(\epsilon\) depends on \(\rho_r\) and the perturbation structure.

Interactive: Spectral Energy Thermometer
Rank \(r\): r = 2
Bar charts show the top-10 eigenvalues of representative graphs from each family. Blue bars are included at the current rank \(r\); gray bars are excluded. The energy percentage shows how much of the total spectral energy is captured.

4. Engineering for Heterogeneous Hardware

The Rank-2 algorithm generates a polynomial number of candidate solutions that must each be evaluated. This number would take a single CPU core a lot of time, sometimes a couple of years even for graphs of \(n = 1000\) nodes. But each evaluation is a dot product of length \(n\), and dot products are exactly what GPUs were built for. The challenge is not the computation per candidate; it is orchestrating millions of parallel evaluations across 15 GPUs on 4 machines connected by a campus network.

Why not Ray? Why not Triton?

Our first instinct was to use Ray for distributed task scheduling. Ray is elegant and well-documented — and it does not build on ppc64le (the PowerPC architecture running on kp001 and kp002). We also considered Triton for custom GPU kernels, but Triton requires CUDA compute capability 7.0+, and the P100 is stuck at 6.0. We were left with the most portable tool in the PyTorch ecosystem: torch.multiprocessing.

The worker design

Each machine runs a worker.py process that receives its assigned slice of the candidate space via SSH, computes the cut values on its local GPUs using batched matrix operations, and writes the best result to a shared NFS directory. The coordinator on anton.k0 launches workers, monitors output files, and aggregates results. There is no inter-worker communication during execution — pure embarrassing parallelism.

Throughput-proportional splitting

The naive approach is to split candidates equally: each of the 15 GPUs gets 1/15 of the work. But our GPUs are not equal. The x86 machines (anton.k0, anton.k1) achieve roughly 20% higher throughput than the PowerPC machines (kp001, kp002), partly due to faster host-device transfer and partly due to better-optimized CUDA libraries. Equal splitting means the x86 machines finish early and sit idle while the PowerPC machines grind through their oversized share.

The fix is simple: benchmark each GPU with a small calibration batch (1000 candidates, ~2 seconds) before the main run, then split work proportional to measured throughput. This one change reduced our wall-clock time by 10.3× on a 750-node instance — the difference between 53 minutes and 5.1 minutes. The interactive diagram below shows the effect.

Interactive: GPU Cluster Work Distribution
Click the button to toggle between equal and throughput-proportional work distribution. Red borders indicate bottleneck machines. Progress bars show relative workload.

5. Results

We tested four methods on three graph families (3-regular, torus lattice, stochastic block model) at six sizes (\(n \in \{250, 500, 750, 1000, 1250, 1500\}\)), with three random seeds each — 54 instances in total. The methods are:

All Rank-2 runs used throughput-proportional splitting across 15 GPUs.

Sortable Results Table
Family n Seed Rank-2 Score Rank-2 Time (s) Greedy Score Greedy Time (s) Random Score SDP Score SDP Time (s)
Click column headers to sort. Use the filter buttons to show a single graph family. Green cells indicate the highest score for each instance. Light blue cells highlight the faster method between Rank-2 and SDP.

Key findings

Torus graphs: Rank-2 matches SDP perfectly at every size tested, consistently achieving the optimal cut value. This is expected — torus Laplacians have extremely concentrated spectra (top-2 eigenvalues capture >99% of energy), so the rank-2 approximation is nearly exact. At \(n = 1008\), Rank-2 on 15 GPUs runs in 1,535s compared to SDP's 1,581s — comparable speed but with the advantage of easy parallelization to more machines.

3-Regular graphs: Rank-2 beats or matches SDP on instances with \(n \geq 1000\). At \(n = 1000\), the average Rank-2 score (7,043) exceeds the average SDP score (7,030). The gap widens at \(n = 1500\), where Rank-2 averages 10,548 versus SDP's 10,494 — a modest but consistent advantage. This is noteworthy because regular graphs are not as spectrally concentrated as torus graphs, yet the low-rank approximation still captures enough structure to be competitive.

SBM (Stochastic Block Model): Here we must be honest — Rank-2 underperforms. On SBM graphs, Greedy consistently achieves the highest cut values, followed by SDP, with Rank-2 typically 5–7% below the best. The reason is clear from the spectral energy ratios: SBM graphs have relatively flat spectra, so the top-2 eigenvalues capture only a small fraction of the total energy. Low-rank methods simply cannot see enough of the graph structure.

Interactive: Runtime Scaling (Log-Log)
Log-log plot of runtime vs. graph size, averaged across seeds. Toggle lines on/off with the checkboxes. Hover for exact values.

Timing comparison

At small sizes (\(n \leq 500\)), SDP is faster than Rank-2 because the \(O(n^5/P)\) cost dominates even with 15 GPUs. But the crossover happens around \(n = 750\text{--}1000\) for regular graphs. At \(n = 1500\), Rank-2 takes about 7,100 seconds while SDP takes about 4,400 seconds on regular graphs — SDP is still faster here. However, Rank-2 scales linearly with additional GPUs: adding just 5 more P100s would bring it below SDP's time, and the algorithm can distribute across any number of machines with no communication overhead.

Scaling to modern hardware

Everything above uses 2016-era P100 GPUs — the "obsolete hardware" story. But the same code, without any modifications, also runs on modern NVIDIA H200 GPUs. This validates that the software engineering is hardware-agnostic: the same worker.py and parallel_rank_r_dir_gpu_fullgpu.py execute on GPUs spanning three generations and two CPU architectures.

The table below shows rank-2 results on a single H200 node. All experiments use the same recursive GPU solver, K=3, and 32-bit precision. The GSet benchmark instances are standard testbeds for Max-Cut algorithms; here we solve the Max-3-Cut formulation on these graphs.

Instance\(n\)GPUsAvg. batch (s)Total timeScore (Max-3-Cut)
GSet-18004×H20046.55.6 min41,939
GSet-18006×H20041.13.4 min41,939
GSet-431,0006×H20049.56.5 min22,965
GSet-222,0006×H20084.31.4 hrs45,791
5-regular3,6008×H200242.018.3 hrs25,320

Rank-2 Max-3-Cut results on H200 GPUs. GSet scores are for the K=3 formulation and are not comparable to published Max-2-Cut (BQP) values, which optimize a different objective.

The standout result is the 5-regular graph with \(n=3{,}600\) on 8 H200 GPUs: 18.3 hours to enumerate all 209 billion rank-2 candidate tuples. This establishes \(n \approx 3{,}600\) as the practical frontier for exact rank-2 enumeration within a 24-hour budget on a single modern GPU node.

How do H200s compare to our P100 cluster? At \(n=1{,}000\), 6 H200s complete the rank-2 search in 392 seconds, compared to 625 seconds on 15 P100s. Per-GPU, each H200 is roughly 4× faster — consistent with its higher memory bandwidth and FP32 throughput. The P100 story is not about per-GPU competitiveness; it's about accessibility. Universities have racks of P100s sitting idle. Our algorithm turns them into a useful combinatorial optimization cluster without needing modern hardware.

A glimpse at Rank-3

The rank-3 algorithm (\(\mathcal{O}(n^7)\) candidates) is validated at small scale on H200s: \(n=50\) completes in 24 seconds, \(n=100\) in 381 seconds, both producing correct optimal scores. However, scaling is severe — extrapolating from observed rates, \(n=1{,}000\) at rank-3 would require approximately 160,000 hours, placing full rank-3 enumeration far beyond any practical budget. This motivates the randomized sampling direction: even partial enumeration of rank-3 candidates can find high-quality solutions (an early checkpoint at \(n=1{,}000\) achieved 86.6% of the rank-2 score after processing less than 1% of candidates).


6. Honest Assessment

Every method has its strengths and weaknesses. Rather than overselling our approach, we present a balanced comparison across five dimensions: solution quality, speed, parallelizability, scalability to large graphs, and theoretical guarantees. The radar chart below visualizes these trade-offs.

Interactive: Method Comparison Radar
Hover over a method name to highlight its polygon and dim others. Five axes: Quality (avg. cut ratio vs. best known), Speed (1/runtime), Parallelizability (speedup with more GPUs), Scalability (largest tractable n), Theory (approximation guarantee strength).

When to use low-rank methods

Use Rank-2 when: your graph has concentrated spectral energy (torus lattices, regular graphs, structured networks), you have access to multiple GPUs (even old ones), you need provable guarantees, and your graph is large enough (\(n \geq 500\)) that SDP's sequential nature becomes a bottleneck.

Use SDP when: you need the best known approximation ratio regardless of runtime, your graph is small (\(n \leq 500\)), or the graph has flat spectral structure (e.g., SBM, Erdos-Renyi) where low-rank methods lose information.

Use Greedy when: you need a quick, good-enough answer. Greedy is surprisingly strong on SBM graphs, often outperforming SDP. Its weakness is the lack of theoretical guarantees and poor performance on structured graphs where the optimal cut has a specific algebraic structure that Greedy's local moves cannot discover.

Limitations

The \(\mathcal{O}(n^5)\) asymptotic complexity of Rank-2 is the elephant in the room. While parallelism hides this for moderate \(n\), at \(n = 2000\) the candidate count reaches \(\sim 10^{13}\), which would require hundreds of GPUs. Rank-3 (\(\mathcal{O}(n^7)\)) is even worse. Future work on randomized candidate sampling — evaluating only a random subset of candidates — could dramatically reduce this cost while maintaining solution quality with high probability.

The second limitation is graph-family dependence. Low-rank methods are not universal: they excel on spectrally concentrated graphs and struggle on graphs with flat spectra. A practitioner must check the spectral energy ratio before deciding whether to use this approach. Fortunately, this check (computing a few top eigenvalues) is itself fast and parallelizable.


7. Code & Reproducibility

All code, data, and scripts to reproduce every experiment in this post are available on GitHub:

github.com/barbara-su/max-k-cut-parallel

Quick start

# Clone and install
git clone https://github.com/barbara-su/max-k-cut-parallel.git
cd max-k-cut-parallel
pip install -r requirements.txt

# Run Rank-2 on a single GPU (torus, n=252)
python run_rank2.py --family torus --n 252 --gpus 1

# Run distributed across multiple machines
python coordinator.py --config cluster.yaml --family regular --n 1000

Hardware requirements

Citation

@article{Stevens2025maxkcut, title = {Exploiting Low-Rank Structure in Max-$K$-Cut Problems}, author = {Stevens, Ria and Liao, Fangshuo and Su, Barbara and Li, Jianqiang and Kyrillidis, Anastasios}, journal = {arXiv preprint arXiv:2602.20376}, year = {2025} }