The Basics of PYTRAJ

Overview

Teaching: 25 min
Exercises: 0 min
Questions
  • How to load MD-trajectories into PYTRAJ?

  • How to calculate RMSD to a reference?

Objectives
  • Learn how to calculate and plot RMSD

  • Learn how to run analysis in parallel using MPI

Computing RMSD

We are now ready to use pytraj in a Jupyter notebook.

View Notebook

import pytraj as pt
import numpy as np
from matplotlib import pyplot as plt

%cd ~/workshop_pytraj/example_02

Load the topology and the trajectory:

traj=pt.iterload('mdcrd_nowat.xtc', top='prmtop_nowat.parm7') 

You can also select slices from each of the trajectories, for example load only frames from 100 to 110 from mdcrd_nowat.xtc:

traj_slice=pt.iterload('mdcrd_nowat.xtc', top='prmtop_nowat.parm7', frame_slice=[(100, 110)]) 

View ptraj.iterload manual

To compute RMSD we need a reference structure. We will use the initial pdb file to see how different is our simulation from the experimental structure. Load the reference frame:

ref_coor = pt.load('inpcrd_nowat.pdb')

Before computing RMSD automatically center and image molecules/residues/atoms that are outside of the box back into the box.

traj=traj.autoimage()

Generate time axis for RMSD plot. The trajectory was saved every 0.001 ns, and we have 3140 frames.

time=np.linspace(0, 3.139, 3140)

Compute and plot RMSD of the protein backbone atoms. We can compute and plot RMSD using the initial pdb file and a frame from the trajectory as reference structure.

rmsd_ref = pt.rmsd(traj, ref=ref_coor, nofit=False, mask='@C,N,O')
rmsd_first= pt.rmsd(traj, ref=traj[0], nofit=False, mask='@C,N,O')
plt.plot(time,rmsd_ref)
plt.plot(time,rmsd_first)
plt.xlabel("Time, ns")
plt.ylabel("RMSD, $ \AA $")
plt.show()

Slicing trajectories

Other ways to slice a trajectory:

print(traj[-1])     # The last frame
print(traj[0:8])    # Frames 0 to 7
print(traj[0:8:2])  # Frames 0 to 7 with stride 2
print(traj[::2])    # All frames with stride 2
print(traj(0,20,4)) # Creates frame iterator with (start, stop, step)

Exercise

  1. Compute and plot RMSD of all nucleic acid atoms (residues U,A,G,C) excluding hydrogens for all frames.
  2. Compute and plot RMSD of all protein atoms (residues 1-859) excluding hydrogens for frames 1000-1999.
  3. Repeat using coordinates from frame 500 as a reference.
    View Atom selection syntax

Solution

time1=np.linspace(0,3.139,3140)
rmsd1 = pt.rmsd(traj, ref=ref_coor, nofit=False, mask='(:U,A,G,C)&!(@H=)')
time2=np.linspace(1,1.999,1000)
rmsd2 = pt.rmsd(traj(1000,2000), ref=ref_coor, nofit=False, mask='(:1-859)&!(@H=)')
plt.plot(time1,rmsd1)
plt.plot(time2,rmsd2)
plt.xlabel("Time, ns")
plt.ylabel("RMSD, $ \AA $")

Distributed Parallel RMSD Calculation

View Notebook

import pytraj as pt
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import pickle

cd ~/workshop_pytraj/example_02

To process trajectory in parallel we need to create a Python script file rmsd.py instead of entering commands in the notebook. This script when executed with srun will use all available MPI tasks. Each task will process its share of frames and send the result to the master task. The master task will “pickle” the computed rmsd data and save it as a Python object.

To create the Python script directly from the notebook we will use Jupyter magic command %%file.

%%file rmsd.py

import pytraj as pt
import pickle
from mpi4py import MPI

# initialize MPI 
comm = MPI.COMM_WORLD

# get the rank of the process
rank = comm.rank

# load the trajectory file
traj=pt.iterload('mdcrd_nowat.xtc', top='prmtop_nowat.parm7') 
ref_coor = pt.load('inpcrd_nowat.pdb')

# call pmap_mpi function for MPI.
# we dont need to specify the number of CPUs, 
# because we will use srun to run the script
data = pt.pmap_mpi(pt.rmsd, traj, mask='@C,N,O', ref=ref_coor)

# pmap_mpi sends data to rank 0
# rank 0 saves data 
if rank == 0:
    with open("rmsd.dat", "wb") as fp: 
         pickle.dump(data, fp)

Run the script on the cluster. We will take advantage of the resources we have already allocated when we launched Jupyter server and simply use srun without requesting anything:

! srun python rmsd.py

In practice you will be submitting large analysis jobs to the queue with the sbatch command from a normal submission script requesting the desired number of MPI tasks (ntasks).

When the job is done we import the results saved in the file rmsd.dat into Python:

with open("rmsd.dat", "rb") as fp: 
    data=pickle.load(fp)
rmsd=data.get('RMSD_00001')

Set *seaborn* plot theme parameters and plot the data:

sns.set_theme()
sns.set_style("darkgrid")
time=np.linspace(0,3.139,3140)
plt.plot(time,rmsd)
plt.xlabel("Time, ns")
plt.ylabel("RMSD, $ \AA $")

Key Points

  • By keeping everything in one easy accessible place Jupyter notebooks greatly simplify the management and sharing of your work