Andrew Dalke dalke@bioreason.com
Thu, 03 Sep 1998 03:23:26 -0700


  I've put my first public release of UPDB, a PDB parser
generator program, at

This is an interim location until I can get an account
on starship.

  It works -- it parses all of the PDB files from the "aa"
subdirectory (my test directory) and when it reassembles from
a dictionary to text the output is the same as the input.
My spot checks don't bring up any problems.

  There will be some changes in layout.  Currently you
need to say "from UPDB.Python.Parser import Parser" for
the master ("attempts to parse all PDB files") parser.
I want to get rid of the extra "Python" level.  I also
want to push the format files to their own directory.

  There's a circular reference in how I build the class
caused my the class storing a map from the record type
to the method used to pack/unpack the record.  I'm not
sure of the best way to eliminate the reference, so
at present call "destroy" to remove the circularity.

  The python parser is pretty slow and I would like
suggestions as to speed improvements.

  The distribution includes formats for PDB format
version 2.1, version 1 ("old-style") and methods for
adding XPLOR extensions.  I've also included untested
formats for the Raster3D "COLOUR" card and the UCSF
scene description cards.

  I've pre-built Version 2.1 and Version 1 parsers;
both with XPLOR extensions that are backwards compatible.
I would also like to have a "only reads ATOM and HETATM"
parser which can be faster by skipping all other records.

  There is a ugly problem with "subtypes", that is,
records with the same name which only differ in their
identifiers. My implementation takes the line and converts
it to a dictionary

(not real ATOM format)
ATOM   CA  1.2 3.4 5.6
   \  /

{ "name" : "CA", "x": 1.2, "y": 3.4, "z": 5.6 }

With this method the opposite direction poses a problem.
I have a dictionary with the keys "name", "x", "y" and "z".
If there are no other records with the same field name
signature, then it is simple to determine the pack function.

Alas, it isn't that easy.  For example, the ENDMDL and END
returns have no data, so they are both represented by an
empty dictionary.  A nearly complete solution is to add the
record name as a "type" in the dictionary, but there are
3 pairs of records with type signature problems anyway.


My current solution is to find the first non-identical
string identifiers for the conflicting records and use that
name as a subtype.  This is an ad hoc solution and doesn't
work if there are more than two records with the same
signature (which the UCSF extensions have, unless I change some
field names).  I would rather have that each subtype for a given
record be assigned a unique name and use that as a subtype
identifier.  I cannot think of any way to do that automatically
(in such a way that the subtype name would be obvious from the
documentation) so all I can think of is mandating those names.
And I would rather the PDB specified those names.

Until then I just decided I'll change the variable names and
not worry about a better solution just yet.  I don't like that
solution as the downside of that is I have to know the actual
name of field for every record, like textX, eyeX, lookX instead
of just "x".

  Finally, I'm on vacation starting Friday morning for a bit
over a week, so I won't be able to answer email after today
until I get back.