Hands-on 6: A Comparative Performance Ranking of the Molecular Dynamics Software

Overview

Teaching: 30 min
Exercises: 5 min
Questions
  • How to evaluate CPU efficiency of a simulation?

  • How fast and efficient my simulation will run with different programs and computing resources?

Objectives
  • Learn how to request the right computational resources

Requesting the right computational resources is essential for fast and efficient simulations. Submitting a simulation with more CPUs does not necessarily mean that it will run faster. In some cases, a simulation will run slower with more CPUs. There is also a choice between using CPU or GPU versions. When deciding on the number of CPUs, it is crucial to consider both simulation speed and CPU efficiency. If CPU efficiency is low, you will be wasting resources. This will negatively impact your priority, and as a result, you will not be able to run as many jobs as you would if you used CPUs more efficiently. To assess CPU efficiency, you need to know how fast a serial simulation runs and then compare the expected 100% efficient speedup (speed on 1CPU x N) with the actual speedup on N CPUs.

Here is the chart of the maximum simulation speed of all MD engines tested on the Alliance systems. These results may give you valuable insight into how fast and efficient you can expect your simulation to run with different packages/resources.

Submission scripts for running the benchmarks.

GROMACS

Extend simulation for 10000 steps

gmx convert-tpr -s topol.tpr -nsteps 10000 -o next.tpr

Submission script for a CPU simulation

#SBATCH --mem-per-cpu=4000M
#SBATCH --time=10:00:00
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4

module load StdEnv/2020 gcc/9.3.0 openmpi/4.0.3 gromacs/2023.2
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

srun gmx mdrun -s next.tpr -cpi state.cpt

Benchmark

Submission script for a single GPU simulation

#!/bin/bash
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=1:00:00   
#SBATCH --cpus-per-task=12
#SBATCH --gpus-per-node=1  

module load StdEnv/2020 gcc/9.3.0 cuda/11.4 openmpi/4.0.3 gromacs/2023.2

gmx mdrun -ntomp ${SLURM_CPUS_PER_TASK:-1} \
    -nb gpu -pme gpu -update gpu -bonded cpu -s topol.tpr

Submission script for a multiple GPU simulation

#!/bin/bash
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=1:00:00   
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=12
#SBATCH --gpus-per-task=1

module load StdEnv/2020 gcc/9.3.0 cuda/11.4 openmpi/4.0.3 gromacs/2023.2

srun gmx mdrun -ntomp ${SLURM_CPUS_PER_TASK:-1} \
-nb gpu -pme gpu -update gpu -bonded cpu -s topol.tpr

Benchmark

PMEMD

Submission script for a single GPU simulation

#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --gpus 1
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=1:00:00

module --force purge
module load StdEnv/2020  gcc/9.3.0 cuda/11.4 openmpi/4.0.3 amber/20.12-20.15
pmemd.cuda -O -i pmemd.in -o production.log -p prmtop.parm7 -c restart.rst7

Submission script for a multiple GPU simulation

Multiple GPU pmemd version is meant to be used only for AMBER methods running multiple simulations, such as replica exchange. A single simulation does not scale beyond 1 GPU.

#!/bin/bash
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=1:00:00

module --force purge
module load StdEnv/2020  gcc/9.3.0 cuda/11.4 openmpi/4.0.3 amber/20.12-20.15

srun pmemd.cuda.MPI -O -i pmemd_prod.in -o production.log \
        -p prmtop.parm7 -c restart.rst7

Benchmark

NAMD 3

Submission script for a GPU simulation

#!/bin/bash
#SBATCH --cpus-per-task=2
#SBATCH --gpus-per-node=a100:2  
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=1:00:00
NAMDHOME=$HOME/NAMD_3.0b3_Linux-x86_64-multicore-CUDA

$NAMDHOME/namd3 +p${SLURM_CPUS_PER_TASK} +idlepoll namd3_input.in  

Benchmark

How to make your simulation run faster?

It is possible to increase time step to 4 fs with hydrogen mass re-partitioning. The idea is that hydrogen masses are increased and at the same time masses of the atoms to which these hydrogens are bonded are decreased to keep the total mass constant. Hydrogen masses can be automatically re-partitioned with the parmed program.

module --force purge
module load StdEnv/2020 gcc ambertools python scipy-stack
source $EBROOTAMBERTOOLS/amber.sh
parmed prmtop.parm7
ParmEd: a Parameter file Editor

Loaded Amber topology file prmtop.parm7
Reading input from STDIN...
> hmassrepartition
> outparm prmtop_hmass.parm7
> quit

References:
1.Lessons learned from comparing molecular dynamics engines on the SAMPL5 dataset
2.Delivering up to 9X the Throughput with NAMD v3 and NVIDIA A100 GPU
3.AMBER GPU Docs
4.Long-Time-Step Molecular Dynamics through Hydrogen Mass Repartitioning

Key Points

  • To assess CPU efficiency, you need to know how fast a serial simulation runs