Bug/feature of findContacts

Andrew Dalke dalke@ks.uiuc.edu
Fri, 14 Aug 1998 23:32:15 -0500


(Before I go on, perhaps I should explain myself somewhat.  I've been
working on basic code structure for small molecules and have got
somewhat wrapped up in architecture issues.  Konrad's reply hit almost
perfectly with what I've been doing so I decided to air out some of my
thoughts on the subject.)


Konrad said:
> Philosophy: Atom indices are an implementation detail introduced for
> the convenience of the C routines. User code should not use them,
> nor ever have to use them, for anything at all. In an
> object-oriented system, every object has a well-defined identity,
> and additional identity schemes only add confusion.

I fully agree with this.  I recently did some translation of code
written with a connection table in mind where to get the list of
neighbors to an atom you had:
  the atom's list of bond ids, as integer
  references in a bond list
  where bonds had an integer list of
  references to an atom list
making the code look like:

  for bond_id in atom.bonds:
    bond = bond_table[bond_id]
    for atom_id in bond.atoms:
      if atom_id <> atom:
        atoms.append(atom_table[atom_id])

whereas a more python-like/OO way was

  for bond in atom.bonds:
    atoms.append(bond.xatom(atom))

(or if you like use a map :)

I managed to cut the code size down by quite a bit and, if done in a
strongly typed language like C++, you can catch a lot of errors at
compile time that crop up when using indicies.  (I remember some of
the ones that arose in VMD before we started using pointers more :)


  Another things I've been doing, now that I've changed jobs and have
started doing small molecule chemistry, is building a Python wrapper
on top of a SWIG translation of the Daylight Toolkit C API.  They have
not indexable numbering system for neither the atoms nor bonds, and
neither does any of my wrapper code, except for the implicit position
in the array.

  This has made many things easy.  For example, Daylight has a concept
of a unique ordering of atoms in a molecule (based on the atoms and
bonds) but our input data is not in that order and we need to convert
between the two.  Since everything is referenced directly, rearranging
the atoms in the list is easy, where as rearranging an index table is
pretty complicated.


> At the C level, each atom is identified by an index into the arrays
> that store the atom-related quantities (position, velocity, charge,
> etc.).

This is the point where I wish I could chide Konrad for still using
indicies :) For our simulation code, most of the information is stored
in an atom or bond class, and VMD stores nearly all of its information
that way.  However, we keep coordinates, velocities and forces in a
different array and use indicies to map between then, since
performance is a lot better in that case.  (And VMD stores coordinate
information that way.)  (We in this case is UIUC and not my current
company.)



> Implementation: Before a C-level routine is called, the
> configuration() method for the universe object must be called to
> ensure the assignment of indices and the construction of the
> configuration array.
> ...
> The index assignment and configuration array construction could
> happen during the construction of the universe, and this would have
> some advantages. But the cost is significant, both in CPU time and
> in memory fragmentation (for each atom added, a new configuration
> array would have to be allocated). Bookkeeping would also get more
> complicated. All that means that I'd rather stick to the current
> method!

So far, other than for the above performance reasons, I have come
across very few cases where I've needed to express relationships as
indicies.  When I've needed them, they've been easy to build for that
existing structure, but I agree that trying to manipulate those tables
during structure modification is tedious.  There's also the question
of how much overhead that costs if you need to do it for every
modification and most of the times it isn't needed.

Besides, there are other ways, like caching information and toggling a
status flag to indicate if a rebuild is needed.  There's still some
overhead but not much and it is much easier to understand.


> Finally, a comment about the PDB indices: when a molecule is created
> from a PDB file, the PDB index is stored for each atom. When the
> final atom indices are assigned, MMTK tries to keep the PDB indices
> whenever possible. 

All this talk of indicies goes double (or triple) for PDB files.
There are at least three ways to index an atom:

  1) the "serial number", which need not be consequtive because for
some reason TER records share the same number space, as in
ATOM   1
TER    2
ATOM   3

  2) the position in the file (the first being numbered 0 or 1 and
increasing by one thereafter

  3) the atom name + residue id + insertion code + chain ( +
variations like segment name and residue name)


						Andrew
						dalke@bioreason.com
P.S.
  The Daylight wrapper package should be released within the next
couple of weeks.  The biggest problem is I've no web server for it.  I
want to get on the Python Starship, but the pages says they aren't
taking new people for several weeks.
  In addition to the package I've been working on a few other
routines, like a SMILES parser and aromatic ring detection.  Is anyone 
on this list interested in working with some of it?