Can PIM Accelerate Molecular Dynamics?

20 min read
Arch4Chem Molecular Dynamics Blog

中文版:分子动力学可以用 PIM 优化的可行性

TL;DR

The question this article answers: Can PIM deliver a meaningful speedup to the main timestep loop of single-GPU all-atom MD?

Short answer: No. Modern single-GPU MD engines (GROMACS / NAMD / AMBER) have already squeezed the dominant kernels hard. On a single GPU, the genuinely memory-bound work that PIM could help with is only about 10–25% of total runtime. Plug that into Amdahl: even with a 10× speedup on that slice, the overall gain is only about 1.1–1.3×—not an order of magnitude.

But step outside the timestep loop: trajectory analysis, neighbour-list preprocessing, ensemble workflows—these data-movement-heavy stages are much friendlier to PIM / in-storage processing. That’s where PIM × MD is actually worth researching.

A few terms

MD / force field / simulation

  • Docking: Fast prediction of how a small molecule fits into a protein binding pocket. More static and approximate than MD.
  • Trajectory: The atomic motion sequence produced by MD—a series of molecular structures over time.
  • Neighbour list: A data structure each MD engine maintains, recording for each atom all the “neighbour” atoms within the cutoff distance. Lets short-range force computation drop from naïve O(N2)O(N^2) pair sum to O(N)O(N) (each atom only looks at ~100 neighbours). Common implementations: Verlet list, cell list. Access pattern is irregular—this is the physical origin of the “irregular memory access” kernel in Table 2 (§3.2). See §2.2 for details.
  • PME (Particle Mesh Ewald): The standard method for computing long-range Coulomb sums under periodic boundary conditions. Not a separate interaction—just how Coulomb is actually evaluated. Involves a grid and 3D FFT.
  • Timestep loop: MD’s main loop—each iteration computes forces and updates atom positions.
  • Conformation / conformational change: Different 3D shapes a molecule can adopt by rotating single bonds (without breaking any). “Conformational change” specifically refers to how a protein’s shape adjusts when a drug binds—the key phenomenon MD captures but docking cannot (induced fit, cryptic pockets, allosteric effects all belong here).
  • AIMD (Ab Initio MD) / QM/MM: MD variants that compute forces using quantum mechanics. AIMD uses QM throughout; QM/MM applies QM to a small region of interest (e.g., a reaction centre) and classical force fields elsewhere. Several orders of magnitude more expensive than classical MD.

Drug discovery pipeline

  • Hit discovery / hit-to-lead: Two phases of drug discovery. Hit discovery is the initial screening stage producing active candidates (hundreds to thousands); hit-to-lead is the phase of optimising, validating, and narrowing hits down to a few lead compounds.
  • FEP (Free Energy Perturbation) / MM-PBSA: Two MD-based methods for accurately computing binding free energy. FEP uses an alchemical pathway—most accurate but costs several GPU·days per molecule; MM-PBSA uses endpoint conformations with implicit solvent—cheaper but less precise.
  • RMSD / RMSF: The two most common metrics in trajectory analysis. RMSD (Root Mean Square Deviation) measures overall deviation from a reference structure; RMSF (Root Mean Square Fluctuation) measures per-atom temporal variability.

Hardware / performance

  • PIM (Processing-in-Memory): An architecture placing compute units near or inside memory cells to reduce data movement. Suited to workloads with low arithmetic intensity and heavy memory access.
  • Memory-bound / compute-bound: Whether the performance bottleneck is data movement (bandwidth, latency) or arithmetic. PIM mainly helps the former.
  • HBM (High Bandwidth Memory): High-bandwidth on-package memory used in modern GPUs, at TB/s scale—the hardware reason modern MD engines can move so much data without choking.
  • Strong scaling: How well performance scales when you fix the problem size and add more compute. Strong scaling of a single MD trajectory is bottlenecked by timestep dependency—synchronisation, not raw compute.

1. What does Molecular Dynamics do?

Molecular Dynamics (MD) computes atomic motion frame by frame using Newtonian mechanics. Put simply, MD answers:

Given all atom positions right now, where will they be a small slice of time from now?

In drug design, MD plays the role of the “molecular movie cinematographer”. The upstream Virtual Screening (VS)—using methods like docking, similarity search (by molecular fingerprint similarity), ML scoring / docking, etc.—filters compound libraries from billions down to a few hundred to a few thousand hits. But VS only gives you a static snapshot: “this small molecule probably fits into the binding pocket.”

In reality, proteins breathe, water molecules dart in and out, side chains swing. So after getting hits, MD takes over: simulate atomic motion frame by frame under Newtonian mechanics for tens to hundreds of nanoseconds, checking whether the molecule actually stays put in water—whether key hydrogen bonds break, how the water molecules in the pocket arrange themselves, and whether the protein undergoes conformational change (see glossary) in response to binding. Only if the “movie” shows the molecule doesn’t escape the pocket do you spend on the more expensive computation downstream (FEP / MM-PBSA).

MD’s usefulness extends well beyond drug discovery: protein folding, membrane proteins, enzyme kinetics, materials science, biomolecular mechanism studies all rely on it. These share the same underlying engine and compute structure, so accelerating MD effectively accelerates an entire downstream workflow.


2. What does MD compute?

Classical MD is built on Newtonian mechanics. For each atom ii:

Fi=miai=riU(r)(1)F_i = m_i a_i = -\nabla_{r_i} U(\mathbf{r}) \tag{1}

where U(r)U(\mathbf{r}) is the total potential energy of the system, written as a function of all atomic coordinates r\mathbf{r}. MD’s core loop: compute the force on each atom, update positions and velocities, advance to the next timestep, repeat.

A simplified MD loop:

Initial atomic structure → Assign force-field parameters → Enter the timestep loop:
    Build / update neighbour list
    Compute bonded forces
    Compute short-range non-bonded forces (LJ + short-range Coulomb)
    Compute long-range electrostatics (PME)
    Apply constraints
    Update positions and velocities
    Occasionally write trajectory
→ Output trajectory

Almost all the runtime cost of this loop goes into force computation—into the specific form of U(r)U(\mathbf{r}). This U(r)U(\mathbf{r}) is called the force field, and it’s the heart of MD. We unpack it below.

2.1 Overall structure of the force field

The force field decomposes the total potential energy into two main categories:

U(r)=UNon-Bonded(r)+UBonded(r)(2)U(\mathbf{r}) = U_{\text{Non-Bonded}}(\mathbf{r}) + U_{\text{Bonded}}(\mathbf{r}) \tag{2}
  • Non-bonded UNon-BondedU_{\text{Non-Bonded}}: interactions you feel just by being close in space—electrostatics, van der Waals
  • Bonded UBondedU_{\text{Bonded}}: geometric deformations between atoms connected by covalent bonds—bond length, bond angle, dihedral

These two categories have very different compute characteristics, which is the foundation of the bottleneck analysis later. We unpack them separately below.

2.2 Non-bonded: the main compute cost in MD

Non-bonded interactions are typically modelled as a combination of electrostatic (Coulomb) and van der Waals (Lennard-Jones) potentials:

UNon-Bonded(r)=ijqiqj4πϵ0rijCoulomb+ij4ϵij[(σijrij)12(σijrij)6]Lennard-Jones(3)U_{\text{Non-Bonded}}(\mathbf{r}) = \underbrace{\sum_{ij} \frac{q_i q_j}{4\pi\epsilon_0 r_{ij}}}_{\text{Coulomb}} + \underbrace{\sum_{ij} 4\epsilon_{ij}\left[\left(\frac{\sigma_{ij}}{r_{ij}}\right)^{12} - \left(\frac{\sigma_{ij}}{r_{ij}}\right)^{6}\right]}_{\text{Lennard-Jones}} \tag{3}
  • Coulomb term: depends on inter-atomic distance rijr_{ij}, vacuum permittivity ϵ0\epsilon_0, and partial charges qiq_i, qjq_j
  • LJ term: 1/r61/r^6 captures long-range attractive dispersion (van der Waals); 1/r121/r^{12} captures short-range repulsion from overlapping electron clouds

Why is non-bonded expensive? Because pair count grows fast with system size. A million-atom system has 5×1011\approx 5 \times 10^{11} pair interactions—and a microsecond MD simulation runs 5×1085 \times 10^8 timesteps, each step recomputing forces. So mainstream MD engines don’t actually run O(N2)O(N^2) pair sum. Instead they use neighbour lists + cutoff to bring the short-range part down to O(N)O(N), and PME (Particle Mesh Ewald) to bring the long-range part down to O(NlogN)O(N \log N).

On neighbour lists: A neighbour list is the key data structure MD engines use to avoid O(N2)O(N^2)—for each atom ii, it pre-records “which atoms jj are within cutoff distance”, and per-step force computation only iterates over these. With typical settings (cutoff \approx 1 nm, density close to liquid water), each atom has about 100 neighbours, so the total pair count drops from N2/2N^2/2 to 100N\sim 100 N—for 1M atoms that’s 5×10115×1075 \times 10^{11} \to 5 \times 10^7, four orders of magnitude smaller.

Two main implementations: Verlet list (Verlet 1967) maintains a per-atom neighbour-index array—conceptually simple but naïve construction is O(N2)O(N^2); Cell list divides the simulation box into a 3D grid of cells with edge length \approx cutoff and puts each atom into one cell, so neighbour lookup only checks the current cell plus 26 adjacent ones—construction is O(N)O(N). Modern engines (GROMACS, NAMD) typically use cell-list–accelerated Verlet-list construction.

On PME / Ewald: The Coulomb sum above is only conditionally convergent under periodic boundary conditions—a naive 1/r1/r sum is not only O(N2)O(N^2), it’s also physically wrong (the result depends on summation order). In practice, Ewald summation splits it into a short-range real-space part (pair sum within a cutoff) and a long-range reciprocal-space part (summed in Fourier space), for O(N3/2)O(N^{3/2}) overall. PME (Particle Mesh Ewald) further approximates the reciprocal part using 3D FFT + grid interpolation, bringing complexity down to O(NlogN)O(N \log N)—it’s the default in GROMACS, NAMD, AMBER (mainstream MD engines). So PME isn’t a separate physical interaction—it’s the standard way to evaluate the Coulomb sum.

Under the PME / Ewald view, the Coulomb sum in Eq (3) expands in practice into a real-space + reciprocal-space pair:

UNon-Bonded(r)=  ijqiqjerfc(αrij)4πϵ0rijCoulomb, real-space+12Vϵ0k0ek2/4α2k2S(k)2Coulomb, reciprocal-space (PME)+ij4ϵij[(σijrij)12(σijrij)6]Lennard-Jones(4)\begin{aligned} U_{\text{Non-Bonded}}(\mathbf{r}) = \; & \underbrace{\sum_{ij} \frac{q_i q_j \, \text{erfc}(\alpha r_{ij})}{4\pi\epsilon_0 r_{ij}}}_{\text{Coulomb, real-space}} + \underbrace{\frac{1}{2V\epsilon_0} \sum_{\mathbf{k} \neq 0} \frac{e^{-k^2/4\alpha^2}}{k^2} |S(\mathbf{k})|^2}_{\text{Coulomb, reciprocal-space (PME)}} \\ & + \underbrace{\sum_{ij} 4\epsilon_{ij}\left[\left(\frac{\sigma_{ij}}{r_{ij}}\right)^{12} - \left(\frac{\sigma_{ij}}{r_{ij}}\right)^{6}\right]}_{\text{Lennard-Jones}} \end{aligned} \tag{4}

where S(k)=iqieikriS(\mathbf{k}) = \sum_i q_i \, e^{i\mathbf{k}\cdot\mathbf{r}_i} is the structure factor (Fourier transform of the charge distribution), and α\alpha is the Ewald parameter controlling how the work is split between real and reciprocal space (larger α\alpha means faster real-space decay but a wider k\mathbf{k} range needed in reciprocal space, and vice versa).

2.3 Bonded: complex form, but linear in count

Bonded interactions describe geometric deformations between atoms connected by covalent bonds—typically four types: bond stretching, angle bending, (proper) dihedral, improper dihedral:

UBonded(r)=Ubond+Uangle+Udihedral+Uimproper(5)U_{\text{Bonded}}(\mathbf{r}) = U_{\text{bond}} + U_{\text{angle}} + U_{\text{dihedral}} + U_{\text{improper}} \tag{5}

Expanded form:

UBonded(r)=  12ijkr(rijr0)2bonds+12ijkkθ(θijkθ0)2angles+ijkln=1Nkϕ,n[1+cos(nϕijklδn)]dihedrals+ijklkω(ωijklω0)2improper dihedrals(6)\begin{aligned} U_{\text{Bonded}}(\mathbf{r}) =\; & \underbrace{\frac{1}{2}\sum_{ij} k_r (r_{ij} - r_0)^2}_{\text{bonds}} + \underbrace{\frac{1}{2}\sum_{ijk} k_\theta (\theta_{ijk} - \theta_0)^2}_{\text{angles}} \\ & + \underbrace{\sum_{ijkl}\sum_{n=1}^{N} k_{\phi,n}\left[1 + \cos(n\phi_{ijkl} - \delta_n)\right]}_{\text{dihedrals}} + \underbrace{\sum_{ijkl} k_\omega (\omega_{ijkl} - \omega_0)^2}_{\text{improper dihedrals}} \end{aligned} \tag{6}

Bonds, angles, and improper dihedrals all use harmonic forms (quadratic penalties around equilibrium values r0r_0, θ0\theta_0, ω0\omega_0); proper dihedrals use a cosine series to capture periodic rotational energy around a bond. Improper dihedrals are typically used to enforce sp² planarity (peptide bonds, aromatic rings, etc.).

Key property of bonded terms: count scales linearly with N. The number of bonds / angles / dihedrals per atom is determined by molecular topology and is constant, so the total count is O(N)O(N) and the overall cost stays modest.

2.4 Compute characteristics per term

Table 1. Functional form, complexity, and compute cost per force-field term. Bonded terms scale linearly in count and are overall cheap; non-bonded terms are MD’s main compute cost, with long-range Coulomb implemented via PME (grid + FFT) whose access pattern differs sharply from short-range pair sums.

TermClassFunctional formComplexityCost
Bond stretchingBondedharmonicO(N)O(N)Low
Angle bendingBondedharmonicO(N)O(N)Low
Dihedral rotationBondedcosine seriesO(N)O(N)Medium
Improper dihedralBondedharmonicO(N)O(N)Low
Lennard-Jones (VdW)Non-bonded12-6, pair sum (cutoff)O(N)O(N) with neighbour list (or O(N2)O(N^2) naïve)High
Coulomb — real-space partNon-bondederfc(αr)/r\text{erfc}(\alpha r)/r, pair sum (cutoff)O(N)O(N) with neighbour listVery high
Coulomb — reciprocal part (PME)Non-bondedcharge spread → 3D FFT → scatter/gatherO(NlogN)O(N \log N)High

§2 summary: The force field’s compute cost concentrates in non-bonded interactions—especially LJ and Coulomb, which sum over all atom pairs. Bonded terms have more complex functional forms but linear count. This imbalance is the root of the kernel-level bottlenecks analysed next.


3. MD’s compute structure: where does the time go?

Now that we know how the force field is computed, the next question is: where does the time in an MD simulation actually go? This decides whether a memory-side accelerator like PIM is worth introducing.

3.1 Why MD is expensive overall

As mentioned above, MD computes atomic motion frame by frame like a movie. Because of this structure, MD’s cost doesn’t come from any single frame being complex—it comes from two other places:

1. Sheer number of repetitions. A typical all-atom MD timestep is 1–2 fs. At 2 fs, 1 ns needs 500,000 steps and 1 µs needs 500,000,000 steps. A million-atom, microsecond simulation has 5×107\sim 5 \times 10^7 short-range interactions per step (assuming ~100 neighbours per atom), cumulatively 2.5×10162.5 \times 10^{16} pair interactions—each step isn’t complex, but it compounds.

2. Strict timestep-to-timestep dependency, limiting parallel scaling. MD is a time-marching (transient) simulation. Force computation inside a timestep can be parallelised heavily, but timesteps have strict dependency: the next step has to wait for the current step’s positions and velocities. This means a single MD trajectory’s strong scaling is bottlenecked by timestep dependency, synchronisation, and communication overhead—throwing more GPUs at it has diminishing returns. An NVIDIA benchmark showed that even with their communication-optimisation library, going beyond ~8 GPU nodes yields rapidly diminishing returns, and can even cause performance drops [4].

3.2 Kernel-level decomposition of a single MD step

§2’s Table 1 lists what mathematical terms make up U(r)U(\mathbf{r}). But when an MD engine actually runs on a GPU, it doesn’t just call one function per Table-1 term. It reorganises those terms into kernels that fit GPU parallelism, plus adds housekeeping operations (force accumulation, integration, I/O) every timestep needs. Table 2 captures this execution view—the mapping from force-field math to GPU kernels.

Table 2. Kernel-level decomposition of a single MD timestep, with mapping to Table 1. Table 1’s 7 force-field terms map to 5 force-computing kernels (Bonded merged; short-range non-bonded sharing the neighbour list; PME split into spread + FFT), plus 3 housekeeping kernels (force accumulation, integration, I/O).

KernelCorresponds to in Table 1RoleDominant bottleneck
1. Bonded forceAll 4 Bonded terms (bond / angle / dihedral / improper)Compute intramolecular geometric deformation energyCheap, regular
2. Short-range non-bonded forceLJ + Coulomb real-spaceCompute pair interactions within cutoffCompute-intensive, memory-sensitive
3. Neighbour list constructionServes short-range non-bonded force (kernel 2)Build the spatial pair listMemory / data-structure heavy
4. Neighbour list traversalServes short-range non-bonded force (kernel 2)Find nearby atom pairsIrregular memory access
5. PME charge spread / interpolationCoulomb reciprocalParticle ↔ grid mappingScatter/gather memory traffic
6. PME FFTCoulomb reciprocalFourier-space summationMemory movement and communication
7. Force accumulation(Across all force terms)Sum force contributions from all kernelsWrite / reduction overhead
8. Integration(Not part of UU; corresponds to Eq 1)Update positions and velocities using FFStreaming memory access
9. Trajectory output(I/O, unrelated to UU)Save frames to diskStorage / I/O bound

The key observation is that MD’s kernels are heterogeneous: no single class of accelerator (including PIM) can be effective on all kernels—at best it can optimise some. This is the physical basis for the Amdahl analysis that follows.

To make the 9 kernels of Table 2 more concrete, the interactive animation below demonstrates each kernel’s access pattern and compute characteristics within a single timestep—the detail panel in the bottom-left shows what’s currently being computed and how many pairs / cells / atoms are involved.

3.3 Memory: capacity isn’t the bottleneck

A common misconception is that MD has heavy memory usage—in fact, memory capacity is not the bottleneck. Take the Max Planck GROMACS benchmark suite as a scale reference [5]: even for a large simulation (12M atoms), the input file is only ~300 MB, with runtime memory in the several-GB range. This fits comfortably in modern GPU memory—a consumer RTX 4090 has 24 GB; a data-centre A100 has 80 GB. So capacity has never been the bottleneck.

But modern GPU MD engines have already brought data movement overhead very low through GPU-resident execution (keeping the entire timestep loop on the GPU to avoid every-step CPU↔GPU transfer), kernel optimisation, and communication / compute overlap [2]. In other words, even with heavy memory traffic, modern GPU HBM bandwidth plus optimised kernels absorb it—memory access isn’t necessarily the global bottleneck.

§3 summary: MD’s time goes almost entirely into a small set of heterogeneous kernels within the timestep loop. Non-bonded computation is the largest slice but already heavily optimised (compute-bound). The remaining memory-bound work is only a fraction. The next section answers: how much real optimisation room PIM has on MD.


4. Does PIM fit MD?

4.1 What PIM likes

PIM is suited to workloads with low arithmetic intensity, heavy memory traffic, poor cache locality, and a significant runtime share.

From Table 2, MD repeatedly reads and writes coordinates, forces, neighbour lists, and PME grids in every timestep. These data are accessed in a loop across millions of timesteps, with high access frequency and irregular locality—well-matched to the patterns PIM likes.

The catch is: when proposing PIM to accelerate MD, what’s the comparison baseline? Modern MD almost exclusively runs on GPUs—GROMACS, NAMD, AMBER, OpenMM, Desmond have all heavily optimised their dominant kernels on GPUs. GROMACS’s docs note that non-bonded pair kernels take 60–90% of runtime [1], and they are already heavily-optimised compute-bound kernels.

So the right question isn’t “can PIM beat CPU at MD?”, but:

Can PIM beat an already heavily-optimised single-GPU baseline at MD?

From §3 we’ve seen: MD’s most expensive part (non-bonded computation) is already compute-bound, not memory-bound—PIM has no structural advantage there.

§4 summary: MD on the surface has PIM-friendly access patterns, but the dominant bottleneck on single GPU is compute, not data movement. PIM’s potential gain is restricted to the remaining memory-bound fraction. The next section quantifies this upper bound with Amdahl.


5. Quantitative analysis: the Amdahl ceiling

5.1 Single-GPU MD runtime breakdown

Aggregating Table 2’s kernels by runtime share:

Table 3. Single-GPU MD runtime breakdown and PIM-fit assessment. Memory-bound portions (PME’s access part + neighbour list) together take ~10–25% of runtime—the potential PIM optimisation space inside the timestep loop. The rest is either compute-bound or off the critical path.

Component% of runtimePIM fitNotes
Optimised non-bonded force computation60–90%Not suitableGPU-optimised: millions of pairs computed in parallel; a textbook compute-bound case
PME10–20%PartialFFT, grid mapping, communication
Neighbour list build / traversal5–15%GoodIrregular memory access, not run every step
Integration / constraints5–10%LimitedCan also be offloaded in GPU-resident mode
I/O and other1–10%SometimesNot on the MD simulation’s critical path

Actual shares still depend on MD engine, force field, PME settings, GPU model, system size, and simulation parameters. For rigorous conclusions on a specific system, you’d want a Nsight / GROMACS timing breakdown profile [3].

The clearly PIM-suitable portion is roughly 10–25% of total runtime—primarily PME’s scatter/gather and neighbour list’s irregular access.

5.2 Amdahl’s Law: PIM’s ceiling

Apply Amdahl’s Law (Eq 7) to quantify what this 10–25% can deliver:

Stotal=1(1f)+fSacc(7)S_{total}=\frac{1}{(1-f)+\frac{f}{S_{acc}}} \tag{7}

where StotalS_{total} is overall speedup, ff is the fraction of runtime that’s accelerable, and SaccS_{acc} is the speedup on that fraction.

Optimistic estimate: take f=0.25f = 0.25 (25% PIM-suitable), assume PIM delivers 10× speedup on the memory-bound part, and ignore overhead:

Stotal=10.75+0.25/10=1/0.7751.29S_{total}=\frac{1}{0.75+0.25/10}=1/0.775\approx 1.29

Overall speedup is just 1.29×.

Theoretical ceiling: even if that 25% becomes infinitely fast:

Smax=110.25=1.33S_{max}=\frac{1}{1-0.25}=1.33

Overall ceiling is just 1.33×.

5.3 Conclusion

The 1.29× / 1.33× number sends a clear signal: on raw single-GPU MD throughput, PIM almost certainly won’t deliver an order-of-magnitude speedup. It can improve some memory-bound kernels, but it cannot move the needle on the compute-bound non-bonded computation that dominates.


6. Where PIM has room: outside the timestep loop

If there’s no room inside the timestep loop, the opportunity is probably outside it. MD’s full lifecycle isn’t just the timestep loop—it also includes data-intensive stages before and after, and that’s where the real PIM × MD opportunities lie.

6.1 Trajectory analysis: a data-centric workload

After a simulation completes, it produces a trajectory file—this is where PIM or in-storage processing (ISP) can really shine.

  • Large data volume: long MD runs don’t save every timestep, but they output trajectory frames at fixed intervals. For million-atom systems on long timescales, trajectory files routinely accumulate to hundreds of GB, sometimes TB.
  • Scan-heavy access pattern: downstream analyses—RMSD/RMSF (see glossary), contact map (matrix of pairwise atomic contacts), hydrogen-bond statistics, distance queries, binding-pose stability filtering (filter out poses that fall apart in short MD)—all need to scan the entire trajectory repeatedly.
  • Simple per-frame computation: per-frame computations (distances, angles, pair statistics) have very low arithmetic intensity.

This “huge data × simple computation × repeated scans” combination is far friendlier to PIM or ISP than the raw MD timestep loop.

§6 summary: PIM × MD research should shift from “accelerating the timestep loop” to “accelerating the data-intensive workflows around MD”.


7. Limitations and scope

The conclusions above rely strictly on a few assumptions—outside these scenarios they may not hold:

  • Single-GPU all-atom MD: this article assumes standard all-atom force fields (AMBER / CHARMM / OPLS family) + PME, running on a single GPU. Coarse-grained models (combining several atoms into a single bead to drastically reduce degrees of freedom, e.g. MARTINI) have a completely different compute structure, and PME may not even apply.
  • Multi-GPU / cluster strong scaling: on multi-GPU, synchronisation overhead and domain decomposition (dividing the simulation box into sub-regions assigned to different GPUs) introduce new bottlenecks.
  • Specialised hardware excluded: D. E. Shaw’s Anton series (a purpose-built ASIC + low-latency interconnect designed specifically for MD, pushing the single-chip timestep loop to its limit) and Google’s TPU-MD take a different path (SIMD).
  • Ab initio MD / QM/MM (AIMD) excluded: this article only covers classical MD (see glossary). AIMD and QM/MM have bottlenecks in SCF iteration (the self-consistent field iteration that solves for the electronic structure, requiring repeated diagonalisation of large matrices) / electron-density evaluation—a completely different problem.
  • Runtime shares are estimates: §5.1’s percentages come from general experience (GROMACS docs + several papers). For a specific target system, profile the actual numbers; don’t apply these directly.

References

[1] gmx nonbonded-benchmark - GROMACS documentation

[2] Performance improvements - GROMACS 2026.2 documentation

[3] Getting good performance from mdrun - GROMACS documentation

[4] Massively Improved Multi-node NVIDIA GPU Scalability with GROMACS

[5] A free GROMACS benchmark set — Max Planck Institute for Multidisciplinary Sciences