Can PIM Accelerate Molecular Dynamics?
TL;DR
The question this article answers: Can PIM deliver a meaningful speedup to the main timestep loop of single-GPU all-atom MD?
Short answer: No. Modern single-GPU MD engines (GROMACS / NAMD / AMBER) have already squeezed the dominant kernels hard. On a single GPU, the genuinely memory-bound work that PIM could help with is only about 10–25% of total runtime. Plug that into Amdahl: even with a 10× speedup on that slice, the overall gain is only about 1.1–1.3×—not an order of magnitude.
But step outside the timestep loop: trajectory analysis, neighbour-list preprocessing, ensemble workflows—these data-movement-heavy stages are much friendlier to PIM / in-storage processing. That’s where PIM × MD is actually worth researching.
A few terms
MD / force field / simulation
- Docking: Fast prediction of how a small molecule fits into a protein binding pocket. More static and approximate than MD.
- Trajectory: The atomic motion sequence produced by MD—a series of molecular structures over time.
- Neighbour list: A data structure each MD engine maintains, recording for each atom all the “neighbour” atoms within the cutoff distance. Lets short-range force computation drop from naïve pair sum to (each atom only looks at ~100 neighbours). Common implementations: Verlet list, cell list. Access pattern is irregular—this is the physical origin of the “irregular memory access” kernel in Table 2 (§3.2). See §2.2 for details.
- PME (Particle Mesh Ewald): The standard method for computing long-range Coulomb sums under periodic boundary conditions. Not a separate interaction—just how Coulomb is actually evaluated. Involves a grid and 3D FFT.
- Timestep loop: MD’s main loop—each iteration computes forces and updates atom positions.
- Conformation / conformational change: Different 3D shapes a molecule can adopt by rotating single bonds (without breaking any). “Conformational change” specifically refers to how a protein’s shape adjusts when a drug binds—the key phenomenon MD captures but docking cannot (induced fit, cryptic pockets, allosteric effects all belong here).
- AIMD (Ab Initio MD) / QM/MM: MD variants that compute forces using quantum mechanics. AIMD uses QM throughout; QM/MM applies QM to a small region of interest (e.g., a reaction centre) and classical force fields elsewhere. Several orders of magnitude more expensive than classical MD.
Drug discovery pipeline
- Hit discovery / hit-to-lead: Two phases of drug discovery. Hit discovery is the initial screening stage producing active candidates (hundreds to thousands); hit-to-lead is the phase of optimising, validating, and narrowing hits down to a few lead compounds.
- FEP (Free Energy Perturbation) / MM-PBSA: Two MD-based methods for accurately computing binding free energy. FEP uses an alchemical pathway—most accurate but costs several GPU·days per molecule; MM-PBSA uses endpoint conformations with implicit solvent—cheaper but less precise.
- RMSD / RMSF: The two most common metrics in trajectory analysis. RMSD (Root Mean Square Deviation) measures overall deviation from a reference structure; RMSF (Root Mean Square Fluctuation) measures per-atom temporal variability.
Hardware / performance
- PIM (Processing-in-Memory): An architecture placing compute units near or inside memory cells to reduce data movement. Suited to workloads with low arithmetic intensity and heavy memory access.
- Memory-bound / compute-bound: Whether the performance bottleneck is data movement (bandwidth, latency) or arithmetic. PIM mainly helps the former.
- HBM (High Bandwidth Memory): High-bandwidth on-package memory used in modern GPUs, at TB/s scale—the hardware reason modern MD engines can move so much data without choking.
- Strong scaling: How well performance scales when you fix the problem size and add more compute. Strong scaling of a single MD trajectory is bottlenecked by timestep dependency—synchronisation, not raw compute.
1. What does Molecular Dynamics do?
Molecular Dynamics (MD) computes atomic motion frame by frame using Newtonian mechanics. Put simply, MD answers:
Given all atom positions right now, where will they be a small slice of time from now?
In drug design, MD plays the role of the “molecular movie cinematographer”. The upstream Virtual Screening (VS)—using methods like docking, similarity search (by molecular fingerprint similarity), ML scoring / docking, etc.—filters compound libraries from billions down to a few hundred to a few thousand hits. But VS only gives you a static snapshot: “this small molecule probably fits into the binding pocket.”
In reality, proteins breathe, water molecules dart in and out, side chains swing. So after getting hits, MD takes over: simulate atomic motion frame by frame under Newtonian mechanics for tens to hundreds of nanoseconds, checking whether the molecule actually stays put in water—whether key hydrogen bonds break, how the water molecules in the pocket arrange themselves, and whether the protein undergoes conformational change (see glossary) in response to binding. Only if the “movie” shows the molecule doesn’t escape the pocket do you spend on the more expensive computation downstream (FEP / MM-PBSA).
MD’s usefulness extends well beyond drug discovery: protein folding, membrane proteins, enzyme kinetics, materials science, biomolecular mechanism studies all rely on it. These share the same underlying engine and compute structure, so accelerating MD effectively accelerates an entire downstream workflow.
2. What does MD compute?
Classical MD is built on Newtonian mechanics. For each atom :
where is the total potential energy of the system, written as a function of all atomic coordinates . MD’s core loop: compute the force on each atom, update positions and velocities, advance to the next timestep, repeat.
A simplified MD loop:
Initial atomic structure → Assign force-field parameters → Enter the timestep loop:
Build / update neighbour list
Compute bonded forces
Compute short-range non-bonded forces (LJ + short-range Coulomb)
Compute long-range electrostatics (PME)
Apply constraints
Update positions and velocities
Occasionally write trajectory
→ Output trajectory
Almost all the runtime cost of this loop goes into force computation—into the specific form of . This is called the force field, and it’s the heart of MD. We unpack it below.
2.1 Overall structure of the force field
The force field decomposes the total potential energy into two main categories:
- Non-bonded : interactions you feel just by being close in space—electrostatics, van der Waals
- Bonded : geometric deformations between atoms connected by covalent bonds—bond length, bond angle, dihedral
These two categories have very different compute characteristics, which is the foundation of the bottleneck analysis later. We unpack them separately below.
2.2 Non-bonded: the main compute cost in MD
Non-bonded interactions are typically modelled as a combination of electrostatic (Coulomb) and van der Waals (Lennard-Jones) potentials:
- Coulomb term: depends on inter-atomic distance , vacuum permittivity , and partial charges ,
- LJ term: captures long-range attractive dispersion (van der Waals); captures short-range repulsion from overlapping electron clouds
Why is non-bonded expensive? Because pair count grows fast with system size. A million-atom system has pair interactions—and a microsecond MD simulation runs timesteps, each step recomputing forces. So mainstream MD engines don’t actually run pair sum. Instead they use neighbour lists + cutoff to bring the short-range part down to , and PME (Particle Mesh Ewald) to bring the long-range part down to .
On neighbour lists: A neighbour list is the key data structure MD engines use to avoid —for each atom , it pre-records “which atoms are within cutoff distance”, and per-step force computation only iterates over these. With typical settings (cutoff 1 nm, density close to liquid water), each atom has about 100 neighbours, so the total pair count drops from to —for 1M atoms that’s , four orders of magnitude smaller.
Two main implementations: Verlet list (Verlet 1967) maintains a per-atom neighbour-index array—conceptually simple but naïve construction is ; Cell list divides the simulation box into a 3D grid of cells with edge length cutoff and puts each atom into one cell, so neighbour lookup only checks the current cell plus 26 adjacent ones—construction is . Modern engines (GROMACS, NAMD) typically use cell-list–accelerated Verlet-list construction.
On PME / Ewald: The Coulomb sum above is only conditionally convergent under periodic boundary conditions—a naive sum is not only , it’s also physically wrong (the result depends on summation order). In practice, Ewald summation splits it into a short-range real-space part (pair sum within a cutoff) and a long-range reciprocal-space part (summed in Fourier space), for overall. PME (Particle Mesh Ewald) further approximates the reciprocal part using 3D FFT + grid interpolation, bringing complexity down to —it’s the default in GROMACS, NAMD, AMBER (mainstream MD engines). So PME isn’t a separate physical interaction—it’s the standard way to evaluate the Coulomb sum.
Under the PME / Ewald view, the Coulomb sum in Eq (3) expands in practice into a real-space + reciprocal-space pair:
where is the structure factor (Fourier transform of the charge distribution), and is the Ewald parameter controlling how the work is split between real and reciprocal space (larger means faster real-space decay but a wider range needed in reciprocal space, and vice versa).
2.3 Bonded: complex form, but linear in count
Bonded interactions describe geometric deformations between atoms connected by covalent bonds—typically four types: bond stretching, angle bending, (proper) dihedral, improper dihedral:
Expanded form:
Bonds, angles, and improper dihedrals all use harmonic forms (quadratic penalties around equilibrium values , , ); proper dihedrals use a cosine series to capture periodic rotational energy around a bond. Improper dihedrals are typically used to enforce sp² planarity (peptide bonds, aromatic rings, etc.).
Key property of bonded terms: count scales linearly with N. The number of bonds / angles / dihedrals per atom is determined by molecular topology and is constant, so the total count is and the overall cost stays modest.
2.4 Compute characteristics per term
Table 1. Functional form, complexity, and compute cost per force-field term. Bonded terms scale linearly in count and are overall cheap; non-bonded terms are MD’s main compute cost, with long-range Coulomb implemented via PME (grid + FFT) whose access pattern differs sharply from short-range pair sums.
| Term | Class | Functional form | Complexity | Cost |
|---|---|---|---|---|
| Bond stretching | Bonded | harmonic | Low | |
| Angle bending | Bonded | harmonic | Low | |
| Dihedral rotation | Bonded | cosine series | Medium | |
| Improper dihedral | Bonded | harmonic | Low | |
| Lennard-Jones (VdW) | Non-bonded | 12-6, pair sum (cutoff) | with neighbour list (or naïve) | High |
| Coulomb — real-space part | Non-bonded | , pair sum (cutoff) | with neighbour list | Very high |
| Coulomb — reciprocal part (PME) | Non-bonded | charge spread → 3D FFT → scatter/gather | High |
§2 summary: The force field’s compute cost concentrates in non-bonded interactions—especially LJ and Coulomb, which sum over all atom pairs. Bonded terms have more complex functional forms but linear count. This imbalance is the root of the kernel-level bottlenecks analysed next.
3. MD’s compute structure: where does the time go?
Now that we know how the force field is computed, the next question is: where does the time in an MD simulation actually go? This decides whether a memory-side accelerator like PIM is worth introducing.
3.1 Why MD is expensive overall
As mentioned above, MD computes atomic motion frame by frame like a movie. Because of this structure, MD’s cost doesn’t come from any single frame being complex—it comes from two other places:
1. Sheer number of repetitions. A typical all-atom MD timestep is 1–2 fs. At 2 fs, 1 ns needs 500,000 steps and 1 µs needs 500,000,000 steps. A million-atom, microsecond simulation has short-range interactions per step (assuming ~100 neighbours per atom), cumulatively pair interactions—each step isn’t complex, but it compounds.
2. Strict timestep-to-timestep dependency, limiting parallel scaling. MD is a time-marching (transient) simulation. Force computation inside a timestep can be parallelised heavily, but timesteps have strict dependency: the next step has to wait for the current step’s positions and velocities. This means a single MD trajectory’s strong scaling is bottlenecked by timestep dependency, synchronisation, and communication overhead—throwing more GPUs at it has diminishing returns. An NVIDIA benchmark showed that even with their communication-optimisation library, going beyond ~8 GPU nodes yields rapidly diminishing returns, and can even cause performance drops [4].
3.2 Kernel-level decomposition of a single MD step
§2’s Table 1 lists what mathematical terms make up . But when an MD engine actually runs on a GPU, it doesn’t just call one function per Table-1 term. It reorganises those terms into kernels that fit GPU parallelism, plus adds housekeeping operations (force accumulation, integration, I/O) every timestep needs. Table 2 captures this execution view—the mapping from force-field math to GPU kernels.
Table 2. Kernel-level decomposition of a single MD timestep, with mapping to Table 1. Table 1’s 7 force-field terms map to 5 force-computing kernels (Bonded merged; short-range non-bonded sharing the neighbour list; PME split into spread + FFT), plus 3 housekeeping kernels (force accumulation, integration, I/O).
| Kernel | Corresponds to in Table 1 | Role | Dominant bottleneck |
|---|---|---|---|
| 1. Bonded force | All 4 Bonded terms (bond / angle / dihedral / improper) | Compute intramolecular geometric deformation energy | Cheap, regular |
| 2. Short-range non-bonded force | LJ + Coulomb real-space | Compute pair interactions within cutoff | Compute-intensive, memory-sensitive |
| 3. Neighbour list construction | Serves short-range non-bonded force (kernel 2) | Build the spatial pair list | Memory / data-structure heavy |
| 4. Neighbour list traversal | Serves short-range non-bonded force (kernel 2) | Find nearby atom pairs | Irregular memory access |
| 5. PME charge spread / interpolation | Coulomb reciprocal | Particle ↔ grid mapping | Scatter/gather memory traffic |
| 6. PME FFT | Coulomb reciprocal | Fourier-space summation | Memory movement and communication |
| 7. Force accumulation | (Across all force terms) | Sum force contributions from all kernels | Write / reduction overhead |
| 8. Integration | (Not part of ; corresponds to Eq 1) | Update positions and velocities using | Streaming memory access |
| 9. Trajectory output | (I/O, unrelated to ) | Save frames to disk | Storage / I/O bound |
The key observation is that MD’s kernels are heterogeneous: no single class of accelerator (including PIM) can be effective on all kernels—at best it can optimise some. This is the physical basis for the Amdahl analysis that follows.
To make the 9 kernels of Table 2 more concrete, the interactive animation below demonstrates each kernel’s access pattern and compute characteristics within a single timestep—the detail panel in the bottom-left shows what’s currently being computed and how many pairs / cells / atoms are involved.
3.3 Memory: capacity isn’t the bottleneck
A common misconception is that MD has heavy memory usage—in fact, memory capacity is not the bottleneck. Take the Max Planck GROMACS benchmark suite as a scale reference [5]: even for a large simulation (12M atoms), the input file is only ~300 MB, with runtime memory in the several-GB range. This fits comfortably in modern GPU memory—a consumer RTX 4090 has 24 GB; a data-centre A100 has 80 GB. So capacity has never been the bottleneck.
But modern GPU MD engines have already brought data movement overhead very low through GPU-resident execution (keeping the entire timestep loop on the GPU to avoid every-step CPU↔GPU transfer), kernel optimisation, and communication / compute overlap [2]. In other words, even with heavy memory traffic, modern GPU HBM bandwidth plus optimised kernels absorb it—memory access isn’t necessarily the global bottleneck.
§3 summary: MD’s time goes almost entirely into a small set of heterogeneous kernels within the timestep loop. Non-bonded computation is the largest slice but already heavily optimised (compute-bound). The remaining memory-bound work is only a fraction. The next section answers: how much real optimisation room PIM has on MD.
4. Does PIM fit MD?
4.1 What PIM likes
PIM is suited to workloads with low arithmetic intensity, heavy memory traffic, poor cache locality, and a significant runtime share.
From Table 2, MD repeatedly reads and writes coordinates, forces, neighbour lists, and PME grids in every timestep. These data are accessed in a loop across millions of timesteps, with high access frequency and irregular locality—well-matched to the patterns PIM likes.
The catch is: when proposing PIM to accelerate MD, what’s the comparison baseline? Modern MD almost exclusively runs on GPUs—GROMACS, NAMD, AMBER, OpenMM, Desmond have all heavily optimised their dominant kernels on GPUs. GROMACS’s docs note that non-bonded pair kernels take 60–90% of runtime [1], and they are already heavily-optimised compute-bound kernels.
So the right question isn’t “can PIM beat CPU at MD?”, but:
Can PIM beat an already heavily-optimised single-GPU baseline at MD?
From §3 we’ve seen: MD’s most expensive part (non-bonded computation) is already compute-bound, not memory-bound—PIM has no structural advantage there.
§4 summary: MD on the surface has PIM-friendly access patterns, but the dominant bottleneck on single GPU is compute, not data movement. PIM’s potential gain is restricted to the remaining memory-bound fraction. The next section quantifies this upper bound with Amdahl.
5. Quantitative analysis: the Amdahl ceiling
5.1 Single-GPU MD runtime breakdown
Aggregating Table 2’s kernels by runtime share:
Table 3. Single-GPU MD runtime breakdown and PIM-fit assessment. Memory-bound portions (PME’s access part + neighbour list) together take ~10–25% of runtime—the potential PIM optimisation space inside the timestep loop. The rest is either compute-bound or off the critical path.
| Component | % of runtime | PIM fit | Notes |
|---|---|---|---|
| Optimised non-bonded force computation | 60–90% | Not suitable | GPU-optimised: millions of pairs computed in parallel; a textbook compute-bound case |
| PME | 10–20% | Partial | FFT, grid mapping, communication |
| Neighbour list build / traversal | 5–15% | Good | Irregular memory access, not run every step |
| Integration / constraints | 5–10% | Limited | Can also be offloaded in GPU-resident mode |
| I/O and other | 1–10% | Sometimes | Not on the MD simulation’s critical path |
Actual shares still depend on MD engine, force field, PME settings, GPU model, system size, and simulation parameters. For rigorous conclusions on a specific system, you’d want a Nsight / GROMACS timing breakdown profile [3].
The clearly PIM-suitable portion is roughly 10–25% of total runtime—primarily PME’s scatter/gather and neighbour list’s irregular access.
5.2 Amdahl’s Law: PIM’s ceiling
Apply Amdahl’s Law (Eq 7) to quantify what this 10–25% can deliver:
where is overall speedup, is the fraction of runtime that’s accelerable, and is the speedup on that fraction.
Optimistic estimate: take (25% PIM-suitable), assume PIM delivers 10× speedup on the memory-bound part, and ignore overhead:
Overall speedup is just 1.29×.
Theoretical ceiling: even if that 25% becomes infinitely fast:
Overall ceiling is just 1.33×.
5.3 Conclusion
The 1.29× / 1.33× number sends a clear signal: on raw single-GPU MD throughput, PIM almost certainly won’t deliver an order-of-magnitude speedup. It can improve some memory-bound kernels, but it cannot move the needle on the compute-bound non-bonded computation that dominates.
6. Where PIM has room: outside the timestep loop
If there’s no room inside the timestep loop, the opportunity is probably outside it. MD’s full lifecycle isn’t just the timestep loop—it also includes data-intensive stages before and after, and that’s where the real PIM × MD opportunities lie.
6.1 Trajectory analysis: a data-centric workload
After a simulation completes, it produces a trajectory file—this is where PIM or in-storage processing (ISP) can really shine.
- Large data volume: long MD runs don’t save every timestep, but they output trajectory frames at fixed intervals. For million-atom systems on long timescales, trajectory files routinely accumulate to hundreds of GB, sometimes TB.
- Scan-heavy access pattern: downstream analyses—RMSD/RMSF (see glossary), contact map (matrix of pairwise atomic contacts), hydrogen-bond statistics, distance queries, binding-pose stability filtering (filter out poses that fall apart in short MD)—all need to scan the entire trajectory repeatedly.
- Simple per-frame computation: per-frame computations (distances, angles, pair statistics) have very low arithmetic intensity.
This “huge data × simple computation × repeated scans” combination is far friendlier to PIM or ISP than the raw MD timestep loop.
§6 summary: PIM × MD research should shift from “accelerating the timestep loop” to “accelerating the data-intensive workflows around MD”.
7. Limitations and scope
The conclusions above rely strictly on a few assumptions—outside these scenarios they may not hold:
- Single-GPU all-atom MD: this article assumes standard all-atom force fields (AMBER / CHARMM / OPLS family) + PME, running on a single GPU. Coarse-grained models (combining several atoms into a single bead to drastically reduce degrees of freedom, e.g. MARTINI) have a completely different compute structure, and PME may not even apply.
- Multi-GPU / cluster strong scaling: on multi-GPU, synchronisation overhead and domain decomposition (dividing the simulation box into sub-regions assigned to different GPUs) introduce new bottlenecks.
- Specialised hardware excluded: D. E. Shaw’s Anton series (a purpose-built ASIC + low-latency interconnect designed specifically for MD, pushing the single-chip timestep loop to its limit) and Google’s TPU-MD take a different path (SIMD).
- Ab initio MD / QM/MM (AIMD) excluded: this article only covers classical MD (see glossary). AIMD and QM/MM have bottlenecks in SCF iteration (the self-consistent field iteration that solves for the electronic structure, requiring repeated diagonalisation of large matrices) / electron-density evaluation—a completely different problem.
- Runtime shares are estimates: §5.1’s percentages come from general experience (GROMACS docs + several papers). For a specific target system, profile the actual numbers; don’t apply these directly.
References
[1] gmx nonbonded-benchmark - GROMACS documentation
[2] Performance improvements - GROMACS 2026.2 documentation
[3] Getting good performance from mdrun - GROMACS documentation
[4] Massively Improved Multi-node NVIDIA GPU Scalability with GROMACS
[5] A free GROMACS benchmark set — Max Planck Institute for Multidisciplinary Sciences