Measured · Three machines · May 2026

Throughput — head‑to‑head.

Five representative MD systems, three backends, one timing protocol — replicated on three independent machines spanning a 3× range in CPU clock and three different GPU generations. Every run measures ns/day from a single integrator.step(N) call — no per-step Python, no CPU↔GPU traffic in the timed window. The CPU scan isolates exactly where openmm‑plumed’s overhead lives.

5systems 3backends 3machines 0 round-trips · GLUED up to 92× faster than openmm‑plumed

Three machines

We benchmark GLUED against openmm‑plumed on hardware you’d realistically run on, not one sweet‑spot machine. Three GPU hosts — a many‑core server CPU, a high‑clock workstation CPU, and a 7 GHz‑boost HEDT CPU. The GPUs weaken across the set (5090 → 4090 → 3090); the CPUs strengthen.

Machine A
RTX 5090 · EPYC 7742
Server‑class · slow‑clock, many‑core CPU
GPU
NVIDIA RTX 5090
32 GiB · sm_120 (Blackwell)
CPU
2× EPYC 7742
64c/128t each · 2.25 GHz max
Memory
528 GiB
DDR4 · 8 GiB swap
CUDA
12.8 (driver 570.211.01)
nvcc V12.8.93
OpenMM
8.4.0
conda‑forge · CUDA 12.8 build
PLUMED
2.9.2 (source)
+opes +pytorch +libtorch
PyTorch
2.11.0 + cu128
Blackwell sm_120 wheel
OS
Ubuntu 24.04
Linux 5.15 · Docker
Machine B
RTX 4090 · Core i9‑7960X
Workstation‑class · high‑clock, fewer‑core CPU
GPU
NVIDIA RTX 4090
24 GiB · sm_89 (Ada Lovelace)
CPU
Intel i9‑7960X
16c/32t · 4.40 GHz max (2.80 base)
Memory
128 GiB
DDR4 · 2 GiB swap
CUDA
12.4 (driver 550.78)
nvcc V12.4.131
OpenMM
8.4.0
source build · ABI=0
PLUMED
2.9.3 (source)
+all modules +libtorch
PyTorch
2.6.0 + cu124
Ada sm_89 wheel
OS
Ubuntu 22.04
Linux 5.15 · Docker
Machine C
RTX 3090 · TR PRO 5945WX
HEDT‑class · very‑high‑boost CPU, older GPU
GPU
NVIDIA RTX 3090
24 GiB · sm_86 (Ampere)
CPU
TR PRO 5945WX
12c/24t · 7.02 GHz boost (1.80 base)
Memory
131 GiB
DDR4 · L3 64 MiB
CUDA
12.8 (driver 570.211.01)
nvcc V12.8
OpenMM
8.4.0
conda-forge · CUDA 12.8 build
PLUMED
2.9.2 (source)
+opes +pytorch +libtorch
PyTorch
2.11.0 + cu128
Ampere sm_86 compatible
OS
Ubuntu 24.04
Linux 6.8 · Docker

Three CPUs spanning 2.25 → 4.40 → 7.02 GHz max clock, paired with GPUs that get weaker as we go (5090 → 4090 → 3090). If openmm‑plumed’s overhead were PCIe/sync‑bound, the GPU regression would dominate and small‑system numbers would stay flat or fall. If it’s CPU‑compute‑bound, the small‑system PLUMED numbers should track CPU clock upward, even as the GPU gets weaker. They do, near‑linearly.

What the CPU swap tells us

Result of the three‑machine comparison

openmm‑plumed’s ceiling is the CPU, not the CPU↔GPU transfer.

On Machine A (EPYC 7742 @ 2.25 GHz), openmm‑plumed plateaus around ~50 ns/day across every case from 23 atoms to 168k atoms — a flat line across four orders of magnitude in system size. That alone is consistent with two different bottlenecks: per‑step CPU↔GPU transfer latency, or single‑thread PLUMED compute on the host.

Machines B and C have the same wiring — same openmm‑plumed bridge, same PCIe sync — but progressively faster CPUs and weaker GPUs. If the bottleneck were transfer/sync, the small‑system numbers shouldn’t move (or should drop with the weaker GPU). Instead, they track CPU clock upward near‑linearly:

Machine A · ad_vacuum
49.7ns/day
EPYC 7742 @ 2.25 GHz · 5090
→ 11× →
Machine B · ad_vacuum
558.8ns/day
i9 @ 4.40 GHz · 4090
→ 24× →
Machine C · ad_vacuum
1 197.6ns/day
TR PRO @ 7.02 GHz · 3090

Machine C has the slowest GPU in the comparison and still beats Machine A by 24× on the small case — because PLUMED’s bottleneck never touches the GPU. That isolates it definitively: PLUMED’s single‑thread CPU compute is the limit on small systems; the GPU sits idle waiting for the host to finish each bias step.

On larger systems (KOR, Kv1.2) the MD step itself dominates wall time and the gap narrows: GLUED leads PLUMED by 1.35–3.6× on Machines B and C, and 3–6× on Machine A — but on these large systems the absolute wall time is decided by GPU strength, so the 5090 host wins overall.

Headline numbers

Speedup ranges read Machine AMachine BMachine C. Large Machine A speedups reflect the slow‑clock CPU bottlenecking PLUMED hardest; the right‑most number is the floor where GLUED still wins on the fastest‑CPU host.

92× → 11× → 4.3×
vs openmm‑plumed · small CVs
Alanine dipeptide vacuum (OPES on dihedral). Drops with CPU clock — PLUMED was CPU‑bound here.
6.1× → 2.1× → 1.35×
vs openmm‑plumed · 71k‑atom GPCR
KOR + analytical A100 activation score. Membrane‑protein scale.
3.0× → 1.5× → 1.02×
vs openmm‑plumed · learned CV
KOR + TorchScript MLP, zero GPU round‑trips. Convergence on Machine C reflects the MLP cost dominating either path.
~1.0×
vs raw OpenMM
GLUED’s overhead on top of unbiased MD across every analytical‑CV case — on all three machines.

Per‑case throughput · ns/day

Same OpenMM LangevinMiddleIntegrator, dt = 2 fs, warmup + timed window, on all three hosts. The bias wiring is the only thing that changes across rows; the host is the only thing that changes within a case.

raw OpenMM (no bias) GLUED (this work) PLUMED + openmm-plumed A · 5090+EPYC B · 4090+i9 C · 3090+TR PRO
Case / Host Atoms raw OpenMM GLUED PLUMED + OMM GLUED / raw GLUED / PLUMED
Alanine dipeptide — vacuumad_vacuum · CV_DIHEDRAL · OPES_METAD 23
A · 5090+EPYC 6 403.4 4 589.4 49.7 0.72× 92.3×
B · 4090+i9 9 574.2 5 915.9 558.8 0.62× 10.6×
C · 3090+TR PRO 7 332.2 5 190.7 1 197.6 0.71× 4.33×
Alanine dipeptide — in waterad_water · CV_DIHEDRAL · OPES_METAD · PME 722
A · 5090+EPYC 2 015.0 1 847.4 56.6 0.92× 32.6×
B · 4090+i9 3 228.8 2 960.0 496.8 0.92× 5.96×
C · 3090+TR PRO 2 687.4 2 553.6 750.6 0.95× 3.40×
κ‑opioid receptor — A100 activation scorekor_a100 · 5× distance + expression · OPES_METAD 71 873
A · 5090+EPYC 332.4 301.7 49.8 0.91× 6.1×
B · 4090+i9 303.7 296.8 143.4 0.98× 2.07×
C · 3090+TR PRO 158.2 157.9 116.7 1.00× 1.35×
κ‑opioid receptor — learned CVkor_deepcv_small · CV_PYTORCH (10 atoms, 2k params) · OPES_EXPLORE 71 873
A · 5090+EPYC 314.5 155.2 51.1 0.49× 3.0×
B · 4090+i9 303.3 137.2 90.5 0.45× 1.52×
C · 3090+TR PRO 158.2 118.4 116.4 0.75× 1.02×
Kv1.2 ion channel — S4 voltage sensorkv12_s4 · 7× position + expression · OPES_EXPLORE 167 817
A · 5090+EPYC 167.4 177.1 48.6 1.06× 3.6×
B · 4090+i9 151.7 150.3 79.0 0.99× 1.90×
C · 3090+TR PRO 73.5 73.8 54.4 1.00× 1.36×

Per‑case visualisation

One card per system, each split into three machine blocks — Machine A, Machine B, and Machine C — so the host‑dependent and host‑independent gaps are immediately readable. Bars are log‑scaled across 10–10 000 ns/day so equal visual lengths correspond to equal × speedups, not equal Δns.

Methodology

Identical protocol across all three backends and all three machines. The only thing that changes is the bias mechanism; raw OpenMM omits the bias entirely as a floor reference.

Zero round-trips on the timed path

The timed phase is one integrator.step(N) call. No getState(getPositions=…), no DCD, no COLVAR, no Python in the inner loop. PyTorch CVs run their forward + autograd backward on the same CUDA stream as the integrator (torch::from_blob on the GPU position pointer, c10::cuda::CUDAStreamGuard).

Same OpenMM context across backends

GLUED and openmm‑plumed differ only in the Force attached to the System. PLUMED’s input is provided as .dat files (one per case); GLUED’s CVs are declared via force.add_* Python calls. Same integrator, same dt, same warmup, same starting positions.

Warmup & sync

A 1000‑step warmup (untimed) lets NVRTC compile every CV/bias kernel and lets caches fill. Before timing starts and ends, a flag‑free context.getState() flushes the CUDA queue so wall time captures every queued kernel.

benchmarks/_common.py · run_timed()
integrator.step(warmup_steps)        # untimed; NVRTC compiles kernels here
context.getState()                     # stream sync — no buffer download

t0 = time.perf_counter()
integrator.step(timed_steps)         # the one timed call — no Python inside
context.getState()                     # stream sync — waits for queued GPU work
elapsed = time.perf_counter() - t0

ns_per_day = timed_steps * dt_ps * 86_400 / (elapsed * 1000)

What the numbers say

The PLUMED bottleneck is host CPU, not PCIe.

On Machine A every PLUMED case lands in 48.6–56.6 ns/day — a flat line across four orders of magnitude in system size. On the small cases that plateau lifts to ~560 ns/day on Machine B and ~1200 ns/day on Machine C — tracking CPU max‑clock (2.25 → 4.40 → 7.02 GHz) almost linearly, even though Machine C’s GPU is the weakest. Transfer/sync latency doesn’t scale with CPU clock; single‑thread PLUMED compute does. That’s the bottleneck.

GLUED’s advantage shrinks as PLUMED catches up.

GLUED’s win over openmm‑plumed is largest when PLUMED is most CPU‑bound (small systems on slow CPUs): 92× on Machine A’s ad_vacuum, 4.3× on Machine C’s ad_vacuum. As the MD step gets expensive (KOR, Kv1.2) or the CPU gets fast enough (Machine C) the gap narrows toward 1.0–1.4×. GLUED still wins everywhere — just by less.

The bias machinery is essentially free.

GLUED’s ratio against raw OpenMM is 0.91–1.06× on Machine A, 0.92–0.99× on Machine B, and 0.95–1.00× on Machine C for every analytical‑CV case (ad_water, kor_a100, kv12_s4). On Machine C’s kor_a100 GLUED hits 157.9 vs 158.2 ns/day — the chain‑rule plumbing is at the noise floor.

Learned CVs: GLUED wins until the model dominates.

On the 10‑atom TorchScript MLP, GLUED leads openmm‑plumed by 3.0× on slow‑CPU Machine A, 1.52× on Machine B, and 1.02× on fast‑CPU Machine C. The convergence on Machine C is meaningful: the MLP forward+backward is now the dominant cost on both paths — PLUMED’s host‑side staging overhead is small relative to the model itself. GLUED still keeps the tensor on‑GPU via zero‑copy torch::from_blob; you pay for the model, nothing else.

23‑atom systems are a microbenchmark, not a use case.

On gas‑phase alanine dipeptide, GLUED runs at 0.62–0.72× of raw OpenMM across all three machines — the bias kernels have a fixed launch cost that’s a real fraction of a step that does almost no MD work. Raw OMM on Machine B hits 9 574 ns/day here; this is the floor where GPU launch overhead, not chemistry, dominates. Treat ad_vacuum as a stress test of the bias plumbing, not a representative workload.

GPU strength still decides large‑system wall time.

Even though PLUMED’s plateau is CPU‑bound, the absolute throughput on big systems is set by the GPU. Raw OMM on kv12_s4 goes 167 (5090) → 152 (4090) → 73 ns/day (3090) — a clean GPU‑generation curve. Machine C has the fastest CPU and lifts PLUMED off the plateau most aggressively, but its 3090 is roughly half the kv12_s4 throughput of Machine A’s 5090. Pick the GPU for the system you run; the CPU clock matters when PLUMED is in the loop.

References

  1. Ibrahim, P., Wifling, D. & Clark, T. A Universal Activation Index for Class A GPCRs. J. Chem. Inf. Model. 59, 3938–3945 (2019). doi:10.1021/acs.jcim.9b00604 Defines the A100 activation index used as the kor_a100 CV
  2. Doering, N. P. et al. Mechanistic Insights into G Protein-Biased κ-Opioid Receptor Signaling Using Dual-Charged Naltrexamine Amides. J. Med. Chem. 69, 3833–3851 (2026). doi:10.1021/acs.jmedchem.5c02135 Applies the A100 index to the KOR system benchmarked here
  3. Bjelkmar, P., Niemelä, P. S., Vattulainen, I. & Lindahl, E. Conformational Changes and Slow Dynamics through Microsecond Polarized Atomistic Molecular Simulation of an Integral Kv1.2 Ion Channel. PLoS Comput. Biol. 5, e1000289 (2009). doi:10.1371/journal.pcbi.1000289 Source of the S4 voltage-sensor motion that the kv12_s4 CV tracks
Build from source → See the workloads View source on GitHub