UPDB
Andrew Dalke
dalke@bioreason.com
Fri, 28 Aug 1998 14:25:37 -0700
Malcolm Gillies <gillies@cmcind.far.ruu.nl> wrote:
> What did you have in mind with regard to support of the various
> mutations of the PDB format which are read and written by various
> modelling programs? Most of the misery I've had to deal with has
> been a result of the quirks of the PDB handling in programs such
> as Insight, Sybyl, DelPhi, GRASP etc.
Can you give me some information about the syntax differences used
by those programs? Not semantics (like how atom names are done)
because I don't attempt to address that at my level of code. I'm
only worried about a column delimited file format with a small set
of fixed data types.
Here's how I can support different formats. The parser generator
takes an input file that describes the format. Each program should
have its own file of definitions. For example, here is the
old-style vs XPLOR extended old-style, ATOM records (sans entry
and line number information):
=========================
This is the original (well, at least 1992) PDB definition for an ATOM
It does not have the element name nor the charge. The insertion code
can be any character unlike the AChar it is now.
COLUMNS DATA TYPE FIELD DEFINITION
------------------------------------------------------------
1 - 6 Record name "ATOM " (obsolete)
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 Character iCode Insertion code.
31 - 38 Real(8.3) x Orthogonal coordinates
for X in Angstroms.
39 - 46 Real(8.3) y Orthogonal coordinates
for Y in Angstroms.
47 - 54 Real(8.3) z Orthogonal coordinates
for Z in Angstroms.
55 - 60 Real(6.2) occupancy Occupancy.
61 - 66 Real(6.2) tempFactor Temperature factor.
This supports the "segment" extension of XPLOR but before the PDB
supported it, which means the difference is the character insertion
code and no charge field.
COLUMNS DATA TYPE FIELD DEFINITION
------------------------------------------------------------
1 - 6 Record name "ATOM " (obsolete) with segID
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 Character iCode Insertion code.
31 - 38 Real(8.3) x Orthogonal coordinates
for X in
Angstroms.
39 - 46 Real(8.3) y Orthogonal coordinates
for Y in Angstroms.
47 - 54 Real(8.3) z Orthogonal coordinates
for Z in Angstroms.
55 - 60 Real(6.2) occupancy Occupancy.
61 - 66 Real(6.2) tempFactor Temperature factor.
73 - 76 LString(4) segID Segment identifier,
left-justified.
======================================
These create different parsers classes, and I have a master class
that figures out which function to call. For example, here's
code for determining which version of the HELIX definition
to use. The old definition does not have the helix length in
columns 71-76. Since the formats should be consistant throughout
the file, I only have to check the type once:
def __init__(self):
self.version2_1 = Version2_1.Version2_1()
self.old = OldVersion.OldVersion()
# these are ones where we need to figure out which version to use
self.unpack_map['HELIX '] = self.resolve_HELIX
self.unpack_map['HELIX'] = self.resolve_HELIX
...
# This has been dropped in the newer version
self.unpack_map['FTNOTE'] = self.old.unpack_FTNOTE
def resolve_HELIX(self, line):
s = line[71:76]
try:
string.atoi(s)
except ValueError:
# old-style
self.unpack_map['HELIX '] = self.old.unpack_HELIX
self.unpack_map['HELIX'] = self.old.unpack_HELIX
return self.unpack_map['HELIX'](line)
# new version
self.unpack_map['HELIX '] = self.version2_1.unpack_HELIX
self.unpack_map['HELIX'] = self.version2_1.unpack_HELIX
return self.unpack_map['HELIX'](line)
def unpack(self, line):
typ = line[:6]
parser = self.unpack_map[typ]
# have something, so work with it
x = parser(line)
# store the type information in the dictionary
# Should the individual parser add the type info instead?
# Probably. This feels too hackish.
x['type'] = typ
return x
So the first time you unpack a 'COMPND', the unpack_map is used to
call the method 'resolve_COMPND'. That checks the fields to
determine which record version it is, replaces the unpack_map value
to the right one, and forwards the call. (I really like Python!)
All the logic is hand written as I could figure no good way to
automate that (eg, given all of the record definitions, create a
master parser to choose between them). That's really hard, and
I don't think there are that many cases to warrent the effort.
> number of significant characters in the atom and residue names,
I knew about, and can handle, the 3 vs. 4 character residue names.
I haven't heard about the same for atom names. Is there a five
character atom name used some where? Or are you talking about
the difference between " CA " and "CA "?
Since I am trying to be as complete as I can, can you tell me
about those types of variations?
Andrew Dalke
dalke@bioreason.com