UPDB

Andrew Dalke dalke@bioreason.com
Fri, 28 Aug 1998 14:25:37 -0700


Malcolm Gillies <gillies@cmcind.far.ruu.nl> wrote:
> What did you have in mind with regard to support of the various
> mutations of the PDB format which are read and written by various
> modelling programs? Most of the misery I've had to deal with has
> been a result of the quirks of the PDB handling in programs such
> as Insight, Sybyl, DelPhi, GRASP etc.

Can you give me some information about the syntax differences used
by those programs?  Not semantics (like how atom names are done)
because I don't attempt to address that at my level of code.  I'm
only worried about a column delimited file format with a small set
of fixed data types.

Here's how I can support different formats.  The parser generator
takes an input file that describes the format.  Each program should
have its own file of definitions. For example, here is the
old-style vs XPLOR extended old-style, ATOM records (sans entry
and line number information):

 =========================
This is the original (well, at least 1992) PDB definition for an ATOM
It does not have the element name nor the charge.  The insertion code
can be any character unlike the AChar it is now.

COLUMNS        DATA TYPE       FIELD         DEFINITION
------------------------------------------------------------
 1 - 6         Record name     "ATOM  "      (obsolete)
 7 - 11        Integer         serial        Atom serial number.
13 - 16        Atom            name          Atom name.
17             Character       altLoc        Alternate location indicator.
18 - 20        Residue name    resName       Residue name.
22             Character       chainID       Chain identifier.
23 - 26        Integer         resSeq        Residue sequence number.
27             Character       iCode         Insertion code.
31 - 38        Real(8.3)       x             Orthogonal coordinates
                                             for X in Angstroms.
39 - 46        Real(8.3)       y             Orthogonal coordinates
                                             for Y in Angstroms.
47 - 54        Real(8.3)       z             Orthogonal coordinates
                                             for Z in Angstroms.
55 - 60        Real(6.2)       occupancy     Occupancy.
61 - 66        Real(6.2)       tempFactor    Temperature factor.



This supports the "segment" extension of XPLOR but before the PDB
supported it, which means the difference is the character insertion
code and no charge field.

COLUMNS        DATA TYPE       FIELD         DEFINITION
------------------------------------------------------------
 1 - 6         Record name     "ATOM  "      (obsolete) with segID
 7 - 11        Integer         serial        Atom serial number.
13 - 16        Atom            name          Atom name.
17             Character       altLoc        Alternate location indicator.
18 - 20        Residue name    resName       Residue name.
22             Character       chainID       Chain identifier.
23 - 26        Integer         resSeq        Residue sequence number.
27             Character       iCode         Insertion code.
31 - 38        Real(8.3)       x             Orthogonal coordinates
                                             for X in
Angstroms.                       
39 - 46        Real(8.3)       y             Orthogonal coordinates
                                             for Y in Angstroms.
47 - 54        Real(8.3)       z             Orthogonal coordinates
                                             for Z in Angstroms.
55 - 60        Real(6.2)       occupancy     Occupancy.
61 - 66        Real(6.2)       tempFactor    Temperature factor.
73 - 76        LString(4)      segID         Segment identifier,
                                             left-justified.

  ======================================


These create different parsers classes, and I have a master class
that figures out which function to call.  For example, here's
code for determining which version of the HELIX definition
to use.  The old definition does not have the helix length in
columns 71-76.  Since the formats should be consistant throughout
the file, I only have to check the type once:

    def __init__(self):
        self.version2_1 = Version2_1.Version2_1()
        self.old = OldVersion.OldVersion()

        # these are ones where we need to figure out which version to use
        self.unpack_map['HELIX '] = self.resolve_HELIX 
        self.unpack_map['HELIX'] = self.resolve_HELIX
        ...
        # This has been dropped in the newer version
        self.unpack_map['FTNOTE'] = self.old.unpack_FTNOTE

    def resolve_HELIX(self, line):
        s = line[71:76]
        try:
            string.atoi(s)
        except ValueError:
            # old-style
            self.unpack_map['HELIX '] = self.old.unpack_HELIX
            self.unpack_map['HELIX'] = self.old.unpack_HELIX
            return self.unpack_map['HELIX'](line)
        # new version
        self.unpack_map['HELIX '] = self.version2_1.unpack_HELIX
        self.unpack_map['HELIX'] = self.version2_1.unpack_HELIX
        return self.unpack_map['HELIX'](line)

    def unpack(self, line):
        typ = line[:6]
        parser = self.unpack_map[typ]
        # have something, so work with it
        x = parser(line)
        # store the type information in the dictionary
        # Should the individual parser add the type info instead?
        # Probably.  This feels too hackish.
        x['type'] = typ
        return x

So the first time you unpack a 'COMPND', the unpack_map is used to
call the method 'resolve_COMPND'.  That checks the fields to
determine which record version it is, replaces the unpack_map value
to the right one, and forwards the call.  (I really like Python!)

  All the logic is hand written as I could figure no good way to
automate that (eg, given all of the record definitions, create a
master parser to choose between them).  That's really hard, and
I don't think there are that many cases to warrent the effort.

> number of significant characters in the atom and residue names,

I knew about, and can handle, the 3 vs. 4 character residue names.
I haven't heard about the same for atom names.  Is there a five
character atom name used some where?  Or are you talking about
the difference between " CA " and "CA  "?

  Since I am trying to be as complete as I can, can you tell me
about those types of variations?

						Andrew Dalke
						dalke@bioreason.com