Checking and Preparing PDB Files

Overview

Teaching: 20 min
Exercises: 10 min

Questions

What problems are commonly found in PDB files?

Why fixing errors in PDB files is essential for a simulation?

Objectives

Understand why it is necessary to check PDB files before simulation.

Learn how to look for problems in PDB files.

Learn how to fix the common errors in PDB files.

What data is needed to setup a simulation?

Molecular simulation systems are typically prepared from PDB files.
For a simulation to be setup, only the coordinate section consisting of ATOM, HETATM, and TER records is required.

         atomName  chain            coordinates                temperatureFactor (beta)
             |       |            x       y       z             |
ATOM      1  N   MET A   1      39.754  15.227  24.484  1.00 46.61
                 |       |                               |
           residueName  residueID                       occupancy
 
ATOM    147  CG AASP A  20      53.919   7.536  24.768  0.50 31.95          
ATOM    148  CG BASP A  20      55.391   5.808  23.334  0.50 32.16    
                |                                     
               conformation                         

TER  - indicates the end of a chain

HETATM  832  O   HOH A 106      32.125   6.262  24.443  1.00 21.18    

The lines beginning with “ATOM” represent the atomic coordinates for standard amino acids and nucleotides.
For chemical compounds other than proteins or nucleic acids, the “HETATM” record type is used.
Records of both types use a simple fixed-column format explained here.
“TER” records indicate which atoms are at the end of a protein chain.

Important Things to Check in a PDB File

A correct simulation of molecules requires error-free input PDB files.

There are several common problems with PDB files, including:

presence of non-protein molecules (crystallographic waters, ligands, modified amino acids, etc.)
alternate conformations
missing side-chain atoms
missing fragments
clashes between atoms
multiple copies of the same protein chains
di-sulfide bonds
wrong assignment of the N and O atoms in the amide groups of ASN and GLN, and the N and C atoms in the imidazole ring of HIS

Connect to the training cluster

4 CPUs
4 hours
default RAM
1 GPUs

Workshop data:

On the training cluster copy archive in your home directory and unpack:

cd
cp /project/def-sponsor00/workshop_amber_2024.tar.gz .
tar xf workshop_amber_2024.tar.gz

Download link

Checking a molecular structure

Check_structure is a command-line utility from BioBB project for exhaustive structure quality checking.

Installing check_structure.

module load StdEnv/2023 python scipy-stack
virtualenv ~/env-biobb
source ~/env-biobb/bin/activate
pip install biobb-structure-checking

Using check_structure.

cd ~/workshop_amber/example_01
check_structure commands # print help on commands
check_structure -i 2qwo.pdb checkall 

...
Detected no residues with alternative location labels
...
Found 154 Residues requiring selection on adding H atoms
...
Detected 348 Water molecules
...
Detected 8 Ligands
...
Detected 1 Possible SS Bonds
...
No severe clashes detected

Removing Non-Protein Molecules

Let’s remove ligands and save the output in a new file called “protein.pdb”.

check_structure -i 2qwo.pdb -o protein.pdb ligands --remove all  

Selecting protein atoms using VMD

Select only protein atoms from the file 2qwo.pdb and save them in the new file protein.pdb using VMD.
Solution

Load vmd module and start the program:
module load StdEnv/2023 vmd
vmd
mol new 2qwo.pdb
set prot [atomselect top "protein"]
$prot writepdb protein.pdb
quit
The first line of code loads a new molecule from 2qwo.pdb. Using the atomselect method, we then select all protein atoms from the top molecule. Finally, we save the selection in the file “protein.pdb”.
The Atom Selection Language has many capabilities. You can learn more about it by visiting the following webpage.

Selecting protein atoms using shell commands

Standard Linux text searching utility grep can find and print all “ATOM” records from a PDB file. This is a good example of using Unix command line, and grep is very useful for many other purposes such as finding important things in log files. Grep searches for a given pattern in files and prints out each line that matches a pattern.

Check if a PDB file has “HETATM” records using grep command.

Select only protein atoms from the file 2qwo.pdb and save them in the new file protein.pdb using grep command to select protein atoms (the “ATOM” and the “TER” records).

Hint: the OR operator in grep is \|. The output from a command can be redirected into a file using the output redirection operator >.
Solution

1.
grep "^HETATM " 1ERT.pdb | wc -l
     46
The ^ expression matches beginning of line. We used the grep command to find all lines beginning with the word “HETATM” and then we sent these lines to the character counting command wc. The output tells us that the downloaded PDB file contains 46 non-protein atoms. In this case, they are just oxygen atoms of the crystal water molecules.

2.
grep "^ATOM\|^TER " 1ERT.pdb > protein.pdb

Checking PDB Files for alternate conformations.

Check conformations

cd ~/workshop_amber/example_02
check_structure -i 1ert.pdb checkall

ASP A20
  CG   A (0.50) B (0.50)
  OD1  A (0.50) B (0.50)
  OD2  A (0.50) B (0.50)
HIS A43
  CG   A (0.50) B (0.50)
  ND1  A (0.50) B (0.50)
  CD2  A (0.50) B (0.50)
  CE1  A (0.50) B (0.50)
  NE2  A (0.50) B (0.50)
SER A90
  OG   A (0.50) B (0.50)

Select conformers A ASP20 and B HIS43.

check_structure -i 1ert.pdb -o output.pdb altloc --select A20:A,A43:B,A90:B 

Selecting Alternate Conformations with VMD

Check if the file 1ERT.pdb has any alternate conformations.

Select conformation A for residues 43, 90. Select conformation B for residue 20. Save the selection in the file “protein_20B_43A_90A.pdb”.
Solution

1.
mol new 1ert.pdb
set s [atomselect top "altloc A"]
$s get resid
set s [atomselect top "altloc B"]
$s get resid
$s get {resid resname name} 
set s [atomselect top "altloc C"]
$s get resid
quit
The output of the commands tells us that residues 20, 43 and 90 have alternate conformations A and B.

2.
mol new 1ERT.pdb
set s [atomselect top "(protein and altloc '') or (altloc B and resid 20) or (altloc A and resid 43 90)"]
$s writepdb protein_20B_43A_90A.pdb
quit

Checking PDB Files for cross-linked cysteines.

For simulation preparation with the AMBER, cross-linked cysteines must be renamed from “CYS” to “CYX”

cd ~/workshop_amber/example_01
check_structure -i 2qwo.pdb -o output.pdb getss --mark all
grep CYX output.pdb

GROMACS pdb2gmx utility can automatically add S-S bonds to the topology based on the distance between sulfur atoms (option -ss).

Useful Links

MDWeb server can help to identify problems with PDB files and visually inspect them. It can also perform complete simulation setup, but options are limited and waiting time in the queue may be quite long.

CHARMM-GUI can be used to generate input files for simulation with CHARMM force fields. CHARMM-GUI offers useful features, for example the “Membrane Builder” and the “Multicomponent Assembler”.

Key Points

Small errors in the input structure may cause MD simulations to became unstable or give unrealistic results.

previous episode

Running Molecular Dynamics on Alliance clusters with AMBER

next episode

Checking and Preparing PDB Files

Overview

What data is needed to setup a simulation?

Important Things to Check in a PDB File

Connect to the training cluster

Workshop data:

Checking a molecular structure

Removing Non-Protein Molecules

Selecting protein atoms using VMD

Solution

Selecting protein atoms using shell commands

Solution

Checking PDB Files for alternate conformations.

Selecting Alternate Conformations with VMD

Solution

Checking PDB Files for cross-linked cysteines.

Useful Links

Key Points

previous episode

next episode