Benchmarks — GLUED

Three machines

We benchmark GLUED against openmm‑plumed on hardware you’d realistically run on, not one sweet‑spot machine. Three GPU hosts — a many‑core server CPU, a high‑clock workstation CPU, and a 7 GHz‑boost HEDT CPU. The GPUs weaken across the set (5090 → 4090 → 3090); the CPUs strengthen.

Machine A

RTX 5090 · EPYC 7742

Server‑class · slow‑clock, many‑core CPU

GPU

NVIDIA RTX 5090

32 GiB · sm_120 (Blackwell)

CPU

2× EPYC 7742

64c/128t each · 2.25 GHz max

Memory

528 GiB

DDR4 · 8 GiB swap

CUDA

12.8 (driver 570.211.01)

nvcc V12.8.93

OpenMM

8.4.0

conda‑forge · CUDA 12.8 build

PLUMED

2.9.2 (source)

+opes +pytorch +libtorch

PyTorch

2.11.0 + cu128

Blackwell sm_120 wheel

OS

Ubuntu 24.04

Linux 5.15 · Docker

Machine B

RTX 4090 · Core i9‑7960X

Workstation‑class · high‑clock, fewer‑core CPU

GPU

NVIDIA RTX 4090

24 GiB · sm_89 (Ada Lovelace)

CPU

Intel i9‑7960X

16c/32t · 4.40 GHz max (2.80 base)

Memory

128 GiB

DDR4 · 2 GiB swap

CUDA

12.4 (driver 550.78)

nvcc V12.4.131

OpenMM

8.4.0

source build · ABI=0

PLUMED

2.9.3 (source)

+all modules +libtorch

PyTorch

2.6.0 + cu124

Ada sm_89 wheel

OS

Ubuntu 22.04

Linux 5.15 · Docker

Machine C

RTX 3090 · TR PRO 5945WX

HEDT‑class · very‑high‑boost CPU, older GPU

GPU

NVIDIA RTX 3090

24 GiB · sm_86 (Ampere)

CPU

TR PRO 5945WX

12c/24t · 7.02 GHz boost (1.80 base)

Memory

131 GiB

DDR4 · L3 64 MiB

CUDA

12.8 (driver 570.211.01)

nvcc V12.8

OpenMM

8.4.0

conda-forge · CUDA 12.8 build

PLUMED

2.9.2 (source)

+opes +pytorch +libtorch

PyTorch

2.11.0 + cu128

Ampere sm_86 compatible

OS

Ubuntu 24.04

Linux 6.8 · Docker

Three CPUs spanning 2.25 → 4.40 → 7.02 GHz max clock, paired with GPUs that get weaker as we go (5090 → 4090 → 3090). If openmm‑plumed’s overhead were PCIe/sync‑bound, the GPU regression would dominate and small‑system numbers would stay flat or fall. If it’s CPU‑compute‑bound, the small‑system PLUMED numbers should track CPU clock upward, even as the GPU gets weaker. They do, near‑linearly.

What the CPU swap tells us

Result of the three‑machine comparison

openmm‑plumed’s ceiling is the CPU, not the CPU↔GPU transfer.

On Machine A (EPYC 7742 @ 2.25 GHz), openmm‑plumed plateaus around ~50 ns/day across every case from 23 atoms to 168k atoms — a flat line across four orders of magnitude in system size. That alone is consistent with two different bottlenecks: per‑step CPU↔GPU transfer latency, or single‑thread PLUMED compute on the host.

Machines B and C have the same wiring — same openmm‑plumed bridge, same PCIe sync — but progressively faster CPUs and weaker GPUs. If the bottleneck were transfer/sync, the small‑system numbers shouldn’t move (or should drop with the weaker GPU). Instead, they track CPU clock upward near‑linearly:

Machine A · ad_vacuum

49.7ns/day

EPYC 7742 @ 2.25 GHz · 5090

→ 11× →

Machine B · ad_vacuum

558.8ns/day

i9 @ 4.40 GHz · 4090

→ 24× →

Machine C · ad_vacuum

1 197.6ns/day

TR PRO @ 7.02 GHz · 3090

Machine C has the slowest GPU in the comparison and still beats Machine A by 24× on the small case — because PLUMED’s bottleneck never touches the GPU. That isolates it definitively: PLUMED’s single‑thread CPU compute is the limit on small systems; the GPU sits idle waiting for the host to finish each bias step.

On larger systems (KOR, Kv1.2) the MD step itself dominates wall time and the gap narrows: GLUED leads PLUMED by 1.35–3.6× on Machines B and C, and 3–6× on Machine A — but on these large systems the absolute wall time is decided by GPU strength, so the 5090 host wins overall.

Headline numbers

Speedup ranges read Machine A → Machine B → Machine C. Large Machine A speedups reflect the slow‑clock CPU bottlenecking PLUMED hardest; the right‑most number is the floor where GLUED still wins on the fastest‑CPU host.

92× → 11× → 4.3×

vs openmm‑plumed · small CVs

Alanine dipeptide vacuum (OPES on dihedral). Drops with CPU clock — PLUMED was CPU‑bound here.

6.1× → 2.1× → 1.35×

vs openmm‑plumed · 71k‑atom GPCR

KOR + analytical A100 activation score. Membrane‑protein scale.

3.0× → 1.5× → 1.02×

vs openmm‑plumed · learned CV

KOR + TorchScript MLP, zero GPU round‑trips. Convergence on Machine C reflects the MLP cost dominating either path.

~1.0×

vs raw OpenMM

GLUED’s overhead on top of unbiased MD across every analytical‑CV case — on all three machines.

Per‑case throughput · ns/day

Same OpenMM LangevinMiddleIntegrator, dt = 2 fs, warmup + timed window, on all three hosts. The bias wiring is the only thing that changes across rows; the host is the only thing that changes within a case.

raw OpenMM (no bias) GLUED (this work) PLUMED + openmm-plumed A · 5090+EPYC B · 4090+i9 C · 3090+TR PRO

Case / Host	Atoms	raw OpenMM	GLUED	PLUMED + OMM	GLUED / raw	GLUED / PLUMED
Alanine dipeptide — vacuumad_vacuum · CV_DIHEDRAL · OPES_METAD	23
A · 5090+EPYC	—	6 403.4	4 589.4	49.7	0.72×	92.3×
B · 4090+i9	—	9 574.2	5 915.9	558.8	0.62×	10.6×
C · 3090+TR PRO	—	7 332.2	5 190.7	1 197.6	0.71×	4.33×
Alanine dipeptide — in waterad_water · CV_DIHEDRAL · OPES_METAD · PME	722
A · 5090+EPYC	—	2 015.0	1 847.4	56.6	0.92×	32.6×
B · 4090+i9	—	3 228.8	2 960.0	496.8	0.92×	5.96×
C · 3090+TR PRO	—	2 687.4	2 553.6	750.6	0.95×	3.40×
κ‑opioid receptor — A100 activation scorekor_a100 · 5× distance + expression · OPES_METAD	71 873
A · 5090+EPYC	—	332.4	301.7	49.8	0.91×	6.1×
B · 4090+i9	—	303.7	296.8	143.4	0.98×	2.07×
C · 3090+TR PRO	—	158.2	157.9	116.7	1.00×	1.35×
κ‑opioid receptor — learned CVkor_deepcv_small · CV_PYTORCH (10 atoms, 2k params) · OPES_EXPLORE	71 873
A · 5090+EPYC	—	314.5	155.2	51.1	0.49×	3.0×
B · 4090+i9	—	303.3	137.2	90.5	0.45×	1.52×
C · 3090+TR PRO	—	158.2	118.4	116.4	0.75×	1.02×
Kv1.2 ion channel — S4 voltage sensorkv12_s4 · 7× position + expression · OPES_EXPLORE	167 817
A · 5090+EPYC	—	167.4	177.1	48.6	1.06×	3.6×
B · 4090+i9	—	151.7	150.3	79.0	0.99×	1.90×
C · 3090+TR PRO	—	73.5	73.8	54.4	1.00×	1.36×

Per‑case visualisation

One card per system, each split into three machine blocks — Machine A, Machine B, and Machine C — so the host‑dependent and host‑independent gaps are immediately readable. Bars are log‑scaled across 10–10 000 ns/day so equal visual lengths correspond to equal × speedups, not equal Δns.

Methodology

Identical protocol across all three backends and all three machines. The only thing that changes is the bias mechanism; raw OpenMM omits the bias entirely as a floor reference.

Zero round-trips on the timed path

The timed phase is one integrator.step(N) call. No getState(getPositions=…), no DCD, no COLVAR, no Python in the inner loop. PyTorch CVs run their forward + autograd backward on the same CUDA stream as the integrator (torch::from_blob on the GPU position pointer, c10::cuda::CUDAStreamGuard).

Same OpenMM context across backends

GLUED and openmm‑plumed differ only in the Force attached to the System. PLUMED’s input is provided as .dat files (one per case); GLUED’s CVs are declared via force.add_* Python calls. Same integrator, same dt, same warmup, same starting positions.

Warmup & sync

A 1000‑step warmup (untimed) lets NVRTC compile every CV/bias kernel and lets caches fill. Before timing starts and ends, a flag‑free context.getState() flushes the CUDA queue so wall time captures every queued kernel.

benchmarks/_common.py · run_timed()

integrator.step(warmup_steps)        # untimed; NVRTC compiles kernels here
context.getState()                     # stream sync — no buffer download

t0 = time.perf_counter()
integrator.step(timed_steps)         # the one timed call — no Python inside
context.getState()                     # stream sync — waits for queued GPU work
elapsed = time.perf_counter() - t0

ns_per_day = timed_steps * dt_ps * 86_400 / (elapsed * 1000)

What the numbers say

The PLUMED bottleneck is host CPU, not PCIe.

On Machine A every PLUMED case lands in 48.6–56.6 ns/day — a flat line across four orders of magnitude in system size. On the small cases that plateau lifts to ~560 ns/day on Machine B and ~1200 ns/day on Machine C — tracking CPU max‑clock (2.25 → 4.40 → 7.02 GHz) almost linearly, even though Machine C’s GPU is the weakest. Transfer/sync latency doesn’t scale with CPU clock; single‑thread PLUMED compute does. That’s the bottleneck.

GLUED’s advantage shrinks as PLUMED catches up.

GLUED’s win over openmm‑plumed is largest when PLUMED is most CPU‑bound (small systems on slow CPUs): 92× on Machine A’s ad_vacuum, 4.3× on Machine C’s ad_vacuum. As the MD step gets expensive (KOR, Kv1.2) or the CPU gets fast enough (Machine C) the gap narrows toward 1.0–1.4×. GLUED still wins everywhere — just by less.

The bias machinery is essentially free.

GLUED’s ratio against raw OpenMM is 0.91–1.06× on Machine A, 0.92–0.99× on Machine B, and 0.95–1.00× on Machine C for every analytical‑CV case (ad_water, kor_a100, kv12_s4). On Machine C’s kor_a100 GLUED hits 157.9 vs 158.2 ns/day — the chain‑rule plumbing is at the noise floor.

Learned CVs: GLUED wins until the model dominates.

On the 10‑atom TorchScript MLP, GLUED leads openmm‑plumed by 3.0× on slow‑CPU Machine A, 1.52× on Machine B, and 1.02× on fast‑CPU Machine C. The convergence on Machine C is meaningful: the MLP forward+backward is now the dominant cost on both paths — PLUMED’s host‑side staging overhead is small relative to the model itself. GLUED still keeps the tensor on‑GPU via zero‑copy torch::from_blob; you pay for the model, nothing else.

23‑atom systems are a microbenchmark, not a use case.

On gas‑phase alanine dipeptide, GLUED runs at 0.62–0.72× of raw OpenMM across all three machines — the bias kernels have a fixed launch cost that’s a real fraction of a step that does almost no MD work. Raw OMM on Machine B hits 9 574 ns/day here; this is the floor where GPU launch overhead, not chemistry, dominates. Treat ad_vacuum as a stress test of the bias plumbing, not a representative workload.

GPU strength still decides large‑system wall time.

Even though PLUMED’s plateau is CPU‑bound, the absolute throughput on big systems is set by the GPU. Raw OMM on kv12_s4 goes 167 (5090) → 152 (4090) → 73 ns/day (3090) — a clean GPU‑generation curve. Machine C has the fastest CPU and lifts PLUMED off the plateau most aggressively, but its 3090 is roughly half the kv12_s4 throughput of Machine A’s 5090. Pick the GPU for the system you run; the CPU clock matters when PLUMED is in the loop.

References

Ibrahim, P., Wifling, D. & Clark, T. A Universal Activation Index for Class A GPCRs. J. Chem. Inf. Model. 59, 3938–3945 (2019). doi:10.1021/acs.jcim.9b00604 Defines the A100 activation index used as the kor_a100 CV
Doering, N. P. et al. Mechanistic Insights into G Protein-Biased κ-Opioid Receptor Signaling Using Dual-Charged Naltrexamine Amides. J. Med. Chem. 69, 3833–3851 (2026). doi:10.1021/acs.jmedchem.5c02135 Applies the A100 index to the KOR system benchmarked here
Bjelkmar, P., Niemelä, P. S., Vattulainen, I. & Lindahl, E. Conformational Changes and Slow Dynamics through Microsecond Polarized Atomistic Molecular Simulation of an Integral Kv1.2 Ion Channel. PLoS Comput. Biol. 5, e1000289 (2009). doi:10.1371/journal.pcbi.1000289 Source of the S4 voltage-sensor motion that the kv12_s4 CV tracks

Build from source → See the workloads View source on GitHub

Throughput — head‑to‑head.

Three machines

What the CPU swap tells us

openmm‑plumed’s ceiling is the CPU, not the CPU↔GPU transfer.

Headline numbers

Per‑case throughput · ns/day

Per‑case visualisation

Methodology

Zero round-trips on the timed path

Same OpenMM context across backends

Warmup & sync

What the numbers say

The PLUMED bottleneck is host CPU, not PCIe.

GLUED’s advantage shrinks as PLUMED catches up.

The bias machinery is essentially free.

Learned CVs: GLUED wins until the model dominates.

23‑atom systems are a microbenchmark, not a use case.

GPU strength still decides large‑system wall time.

References