Throughput — head‑to‑head.
Five representative MD systems, three backends, one timing protocol — replicated on three independent machines spanning a 3× range in CPU clock and three different GPU generations. Every run measures ns/day from a single integrator.step(N) call — no per-step Python, no CPU↔GPU traffic in the timed window. The CPU scan isolates exactly where openmm‑plumed’s overhead lives.
Three machines
We benchmark GLUED against openmm‑plumed on hardware you’d realistically run on, not one sweet‑spot machine. Three GPU hosts — a many‑core server CPU, a high‑clock workstation CPU, and a 7 GHz‑boost HEDT CPU. The GPUs weaken across the set (5090 → 4090 → 3090); the CPUs strengthen.
Three CPUs spanning 2.25 → 4.40 → 7.02 GHz max clock, paired with GPUs that get weaker as we go (5090 → 4090 → 3090). If openmm‑plumed’s overhead were PCIe/sync‑bound, the GPU regression would dominate and small‑system numbers would stay flat or fall. If it’s CPU‑compute‑bound, the small‑system PLUMED numbers should track CPU clock upward, even as the GPU gets weaker. They do, near‑linearly.
What the CPU swap tells us
openmm‑plumed’s ceiling is the CPU, not the CPU↔GPU transfer.
On Machine A (EPYC 7742 @ 2.25 GHz), openmm‑plumed plateaus around ~50 ns/day across every case from 23 atoms to 168k atoms — a flat line across four orders of magnitude in system size. That alone is consistent with two different bottlenecks: per‑step CPU↔GPU transfer latency, or single‑thread PLUMED compute on the host.
Machines B and C have the same wiring — same openmm‑plumed bridge, same PCIe sync — but progressively faster CPUs and weaker GPUs. If the bottleneck were transfer/sync, the small‑system numbers shouldn’t move (or should drop with the weaker GPU). Instead, they track CPU clock upward near‑linearly:
Machine C has the slowest GPU in the comparison and still beats Machine A by 24× on the small case — because PLUMED’s bottleneck never touches the GPU. That isolates it definitively: PLUMED’s single‑thread CPU compute is the limit on small systems; the GPU sits idle waiting for the host to finish each bias step.
On larger systems (KOR, Kv1.2) the MD step itself dominates wall time and the gap narrows: GLUED leads PLUMED by 1.35–3.6× on Machines B and C, and 3–6× on Machine A — but on these large systems the absolute wall time is decided by GPU strength, so the 5090 host wins overall.
Headline numbers
Speedup ranges read Machine A → Machine B → Machine C. Large Machine A speedups reflect the slow‑clock CPU bottlenecking PLUMED hardest; the right‑most number is the floor where GLUED still wins on the fastest‑CPU host.
Per‑case throughput · ns/day
Same OpenMM LangevinMiddleIntegrator, dt = 2 fs, warmup + timed window, on all three hosts. The bias wiring is the only thing that changes across rows; the host is the only thing that changes within a case.
| Case / Host | Atoms | raw OpenMM | GLUED | PLUMED + OMM | GLUED / raw | GLUED / PLUMED |
|---|---|---|---|---|---|---|
| Alanine dipeptide — vacuumad_vacuum · CV_DIHEDRAL · OPES_METAD | 23 | |||||
| A · 5090+EPYC | — | 6 403.4 | 4 589.4 | 49.7 | 0.72× | 92.3× |
| B · 4090+i9 | — | 9 574.2 | 5 915.9 | 558.8 | 0.62× | 10.6× |
| C · 3090+TR PRO | — | 7 332.2 | 5 190.7 | 1 197.6 | 0.71× | 4.33× |
| Alanine dipeptide — in waterad_water · CV_DIHEDRAL · OPES_METAD · PME | 722 | |||||
| A · 5090+EPYC | — | 2 015.0 | 1 847.4 | 56.6 | 0.92× | 32.6× |
| B · 4090+i9 | — | 3 228.8 | 2 960.0 | 496.8 | 0.92× | 5.96× |
| C · 3090+TR PRO | — | 2 687.4 | 2 553.6 | 750.6 | 0.95× | 3.40× |
| κ‑opioid receptor — A100 activation scorekor_a100 · 5× distance + expression · OPES_METAD | 71 873 | |||||
| A · 5090+EPYC | — | 332.4 | 301.7 | 49.8 | 0.91× | 6.1× |
| B · 4090+i9 | — | 303.7 | 296.8 | 143.4 | 0.98× | 2.07× |
| C · 3090+TR PRO | — | 158.2 | 157.9 | 116.7 | 1.00× | 1.35× |
| κ‑opioid receptor — learned CVkor_deepcv_small · CV_PYTORCH (10 atoms, 2k params) · OPES_EXPLORE | 71 873 | |||||
| A · 5090+EPYC | — | 314.5 | 155.2 | 51.1 | 0.49× | 3.0× |
| B · 4090+i9 | — | 303.3 | 137.2 | 90.5 | 0.45× | 1.52× |
| C · 3090+TR PRO | — | 158.2 | 118.4 | 116.4 | 0.75× | 1.02× |
| Kv1.2 ion channel — S4 voltage sensorkv12_s4 · 7× position + expression · OPES_EXPLORE | 167 817 | |||||
| A · 5090+EPYC | — | 167.4 | 177.1 | 48.6 | 1.06× | 3.6× |
| B · 4090+i9 | — | 151.7 | 150.3 | 79.0 | 0.99× | 1.90× |
| C · 3090+TR PRO | — | 73.5 | 73.8 | 54.4 | 1.00× | 1.36× |
Per‑case visualisation
One card per system, each split into three machine blocks — Machine A, Machine B, and Machine C — so the host‑dependent and host‑independent gaps are immediately readable. Bars are log‑scaled across 10–10 000 ns/day so equal visual lengths correspond to equal × speedups, not equal Δns.
Methodology
Identical protocol across all three backends and all three machines. The only thing that changes is the bias mechanism; raw OpenMM omits the bias entirely as a floor reference.
Zero round-trips on the timed path
The timed phase is one integrator.step(N) call. No getState(getPositions=…), no
DCD, no COLVAR, no Python in the inner loop. PyTorch CVs run their forward + autograd backward on the
same CUDA stream as the integrator (torch::from_blob on the GPU position pointer,
c10::cuda::CUDAStreamGuard).
Same OpenMM context across backends
GLUED and openmm‑plumed differ only in the Force attached to the System. PLUMED’s input
is provided as .dat files (one per case); GLUED’s CVs are declared via
force.add_* Python calls. Same integrator, same dt, same warmup, same starting positions.
Warmup & sync
A 1000‑step warmup (untimed) lets NVRTC compile every CV/bias kernel and lets caches fill. Before
timing starts and ends, a flag‑free context.getState() flushes the CUDA queue so wall
time captures every queued kernel.
integrator.step(warmup_steps) # untimed; NVRTC compiles kernels here context.getState() # stream sync — no buffer download t0 = time.perf_counter() integrator.step(timed_steps) # the one timed call — no Python inside context.getState() # stream sync — waits for queued GPU work elapsed = time.perf_counter() - t0 ns_per_day = timed_steps * dt_ps * 86_400 / (elapsed * 1000)
What the numbers say
The PLUMED bottleneck is host CPU, not PCIe.
On Machine A every PLUMED case lands in 48.6–56.6 ns/day — a flat line across four orders of magnitude in system size. On the small cases that plateau lifts to ~560 ns/day on Machine B and ~1200 ns/day on Machine C — tracking CPU max‑clock (2.25 → 4.40 → 7.02 GHz) almost linearly, even though Machine C’s GPU is the weakest. Transfer/sync latency doesn’t scale with CPU clock; single‑thread PLUMED compute does. That’s the bottleneck.
GLUED’s advantage shrinks as PLUMED catches up.
GLUED’s win over openmm‑plumed is largest when PLUMED is most CPU‑bound (small systems on slow CPUs): 92× on Machine A’s ad_vacuum, 4.3× on Machine C’s ad_vacuum. As the MD step gets expensive (KOR, Kv1.2) or the CPU gets fast enough (Machine C) the gap narrows toward 1.0–1.4×. GLUED still wins everywhere — just by less.
The bias machinery is essentially free.
GLUED’s ratio against raw OpenMM is 0.91–1.06× on Machine A, 0.92–0.99× on Machine B, and 0.95–1.00× on Machine C for every analytical‑CV case (ad_water, kor_a100, kv12_s4). On Machine C’s kor_a100 GLUED hits 157.9 vs 158.2 ns/day — the chain‑rule plumbing is at the noise floor.
Learned CVs: GLUED wins until the model dominates.
On the 10‑atom TorchScript MLP, GLUED leads openmm‑plumed by 3.0× on
slow‑CPU Machine A, 1.52× on Machine B, and 1.02× on
fast‑CPU Machine C. The convergence on Machine C is meaningful: the MLP forward+backward is now
the dominant cost on both paths — PLUMED’s host‑side staging overhead is small
relative to the model itself. GLUED still keeps the tensor on‑GPU via zero‑copy
torch::from_blob; you pay for the model, nothing else.
23‑atom systems are a microbenchmark, not a use case.
On gas‑phase alanine dipeptide, GLUED runs at 0.62–0.72× of raw OpenMM across all three machines — the bias kernels have a fixed launch cost that’s a real fraction of a step that does almost no MD work. Raw OMM on Machine B hits 9 574 ns/day here; this is the floor where GPU launch overhead, not chemistry, dominates. Treat ad_vacuum as a stress test of the bias plumbing, not a representative workload.
GPU strength still decides large‑system wall time.
Even though PLUMED’s plateau is CPU‑bound, the absolute throughput on big systems is set by the GPU. Raw OMM on kv12_s4 goes 167 (5090) → 152 (4090) → 73 ns/day (3090) — a clean GPU‑generation curve. Machine C has the fastest CPU and lifts PLUMED off the plateau most aggressively, but its 3090 is roughly half the kv12_s4 throughput of Machine A’s 5090. Pick the GPU for the system you run; the CPU clock matters when PLUMED is in the loop.
References
- Ibrahim, P., Wifling, D. & Clark, T. A Universal Activation Index for Class A GPCRs. J. Chem. Inf. Model. 59, 3938–3945 (2019). doi:10.1021/acs.jcim.9b00604 Defines the A100 activation index used as the kor_a100 CV
- Doering, N. P. et al. Mechanistic Insights into G Protein-Biased κ-Opioid Receptor Signaling Using Dual-Charged Naltrexamine Amides. J. Med. Chem. 69, 3833–3851 (2026). doi:10.1021/acs.jmedchem.5c02135 Applies the A100 index to the KOR system benchmarked here
- Bjelkmar, P., Niemelä, P. S., Vattulainen, I. & Lindahl, E. Conformational Changes and Slow Dynamics through Microsecond Polarized Atomistic Molecular Simulation of an Integral Kv1.2 Ion Channel. PLoS Comput. Biol. 5, e1000289 (2009). doi:10.1371/journal.pcbi.1000289 Source of the S4 voltage-sensor motion that the kv12_s4 CV tracks