PyMCS

Andrew Dalke

June 10, 1999

Andrew Dalke

June 10, 1999

Bioreason, Inc.

150 Washington Ave., Suite 303

Santa Fe, NM 87501

http://www.bioreason.com/

**Description**

MCS is a Python package to find the maximum common substructure between two compounds or, more generally, find all maximal common substructures between two graphs with colored nodes (``atoms'') and edges (``bonds''). For a test set of similar compounds of size roughly 25 atoms, the average pairwise MCS compute time is about 10 seconds.

This document contains architecture and design information along with the mathematical justification for the algorithm, and some initial results. The text was excerpted from our internal documentation and some parts have been edited to remove proprietary information.

The problem is how to find a maximum common substructure of two molecules. The details of the data structures and the justification for how the code works are described later. For now, here is the meat of the way things work.

Everything resolves around what I call a *fully bonded mapping*
between the two graphs. Simply put, this means that there is a
one-to-one and connected mapping between some atoms from the first
graph to the second, and if atoms are mapped to each other then they
must be ``equivalent.'' Also, if two atoms are bonded in the first
graph and their mapped atoms are bonded in the second graph, and the
two bonds are ``equivalent'' then there is a mapping between those two
atoms. The definition of equivalent is specified by the user.

The goal is to find the largest such mapping (others have different definitions of MCS).

The algorithm works by growing all possible maximal mappings from all
possible seeds. A *seed* is a simple mapping of one atom in the
first graph to one in the second.

A mapping is grown by looking at all the bonds between an atom in the
mapping to one outside (these are called *pending* bonds). Pick
one of the pending bonds in the first graph. Next find the set of all
possible pending bonds in the other graph which can be mapped to it.
Since one of the atoms are already mapped, this is pretty easy.

For each term in this set of possibles, add the mapping between the two pending bonds as well as their corresponding pending atoms. Sometimes the newly added atom had other edges which were pending. Since now both atoms are in the subgraph, it is easy to check if there is a bond in the second graph corresponding to it. Add all of these bonds to the mapping, and the result is a newly grown graph which is itself fully bonded. This means we can apply this algorithm recursively until it grows no farther. Recall that each possible mapping of pending bonds can be used to grow the mapping, so there can be branching at each recursion level.

If cycles are in the mapping (and there are few compounds with no
cycles) then this method will generate duplicate solutions. The
output can be made unique by keeping a list of all the previously
tried extensions from a given mapping as a list of *tainted*
edges. If one of them is used again, then since all possible
solutions using it have already been generated, the algorithm can
backtrack.

This algorithm will generate all of the maximal substructures, that is, the structures which cannot be expanded by another atom and still meet the fully bonded criterion. If the goal is to find a maximum common substructures then another cutoff can be gained by comparing the largest maximal structure found to date and, stopping if it is larger than any possible remaining substructure. (Eg, if the best mapping to date has 6 atoms and there are only 5 possible atoms to use for new seeds, then the code can stop searching.)

The MCS code manipulates a graph. A graph has a list of atoms and a
list of bonds. Each atom has a list of bonds connected to it, and
each bond has a list of the two atoms which it connects.
Additionally, if an atom or bond is referenced twice, the reference is
to the same object. Eg, the bonds from an atom in a graph are the
same objects listed as bonds in the graph. This is to distinguish
from systems (like the PyDaylight Molecule class) where the objects
are equal, but not the same object. In Python nomenclature, if `x
== y` then `x is y`.

The MCS data structures are Graph, GraphAtom and GraphBond and are
given in Graph.py. In order to build a Graph, the atom must have a
method which adds a new bond to the list. This is needed because of
the circular data structures involved - there is no way to build the
graph in one shot. Atoms and bonds have an attribute called `type` which store the user-supplied type information needed to compare
two atoms or bonds. This contents of the type field are not important
to the algorithm.

The MCS code uses two bond classifications for the MCS implementation. A bond can be *unused*, *used*, *pending*
or *forbidden*. It can also be *tainted* or *untainted*.
The first attribute is stored as the bond's `status` and the
second is the `tainted` flag.

The MCS algorithm only directly accesses the list of unused and
pending bonds for a given atom, so the status information is never
used directly and could be removed. (That value is used for assertion
checking so is still present in the code.) Instead, the list of `unused` and `pending` bonds are stored in the atom. The graph is
in charge of modifying these lists to reflect changes.

Bonds have a method called `xatom` which is given one of the atoms
in the bond and returns the other atom. If a bond is pending, then it
has the additional methods to return the pending atom (the atom which
isn't in the currently mapped subgraph) and to return the tuple of the
non-pending and pending atoms.

As mentioned, the graph is in charge of changing the list of unused
and pending bonds listed by an atom. They enforce through assertions
the evolution path of bonds through the MCS (eg, a bond can only
become used if previously it was pending, so `use_bond` checks
first that the bond is pending). The methods which directly change an
edge's status are `use_bond`, `forbid_bond` and `make_pending_bonds_forbidden`, and `pend_bond` and `make_unused_bonds_pending`. Also, a graph stores the full list of
used and pending bonds.

Another method, `taint_bond`, is used to set the bond's tainted
flag.

You'll notice in this list that there are no reverse methods, eg, to untaint an edge or to make a forbidden bond pending. This is because the algorithm does not need that functionality. However, there are steps where it expands the subgraph mapping and needs the ability to revert the graph to its original form (eg, during the recursive descent step). One implementation would duplicate the mapping information for each bond extension, but that overhead was deemed too large.

Instead, a graph is modified in-place and the previous state is
restored through a mark and free system. A graph contains a stack of
changes. When `mark` is called, the size of the stack is noted.
As edges are changed, the change information is pushed on the stack.
When `pop` is called, the changes which occurred since the previous
mark are used to revert the graph back to its original form.

Finally, the graph contains a method to find the edge between two given atoms, if it exists, and a destructor, which is needed to break the cycles in the data structure.

One other function is available in the Graph module, `graph_convert`. This converts data structures which have a
graph-like topology of atoms and bond into a proper Graph. The `type` attributes of GraphAtom and GraphBond become the originating
atoms and bonds, and it enforces that if `x == y` then `x is
y`. It is primarily used to convert the PyDaylight Molecule into
Graph form.

The Mapping class keeps track of the atom-to-atom and bond-to-bond
mappings. The methods `append_atoms` and `append_bonds` are
used to add a new correspondence between their two respective data
types. As with the Graph class, a mark and free system is used, so
mappings added since the last `mark` call are removed by calling
`pop`.

A list of all the mapped atom or bond pairs is returned with the `atoms` and `bonds` methods. The MCS code needs the ability to
find the 2nd atom or bond mapped by the first one. These are found
with `get_other_atom` and `get_other_bond`, which raise an
ValueError exception if the item isn't listed.

The Compare class is an abstract base class and as such doesn't need to be declared in Python - what's important is an object which implements the Compare ``protocol.'' This is used to compare type information between atoms and bonds and by using it we move comparison code out of the main MCS routine and into the caller. Thus, the user can tweak how the graph items should be compared, say by including test for aromaticity, or bond type, or valence, or atom type, etc.

An object which implements the Compare protocol must have four methods
and store two attributes. The constructor is passed the two Graphs
being compared. This is useful, eg, to precompute the results. The
graphs must be stored as `graph1` and `graph2`.

The methods `atoms` and `bonds` are used to compare two
GraphAtoms or GraphBonds. The atom or bond from the first graph is
given first. The result must be 0 to indicate that they are different
or 1 to indicate they are the same.

The method `matching_atoms` is passed an atom from the first
graph. It must return all the atoms in the second graph which could
potentially map to it.

I'll point out again that GraphAtoms and GraphBonds are used in the
call parameters. If specific type information is needed, the original
data can be referenced through the `type` attribute.

The Maximal class is also an abstract base class. It implements a
constructor and the methods `maximal` and `set_limit`.

The `maximal` method is called whenever the MCS algorithm
identifies a maximal common substructure. It takes the maximal
mapping as its sole parameter. This is a mutable object (it can
change in the future) so should be cloned if needed later. The method
can be used to determine if a ``good enough'' solution is found, to
store the best solution to date, to print out the results, etc.

If the solution is good enough, then it can raise the exception GoodEnough, which terminates the search. Usually this is called when the best MCS to date is larger than any possible future MCS, as when the number of search atoms is smaller than the mapping size.

The instance is initialized with the size of the largest possible
mapping, which is the size of the smaller of the two compounds. As
atoms are eliminated, the method `set_limit` is called with the
number of atoms remaining. If this new limit becomes smaller than the
best MCS so far, and only one maximum substructure is needed, then
GoodEnough should be raised.

A Maximal object must also store a mapping as the attribute `mapping`. This is what is returned by `find_mcs`. At present
the Maximal is created inside of `find_mcs`. In the future, the
Maximal object will be passed into the function, which gives more
end-user flexibility. When that becomes the case, this attribute
requirement will disappear.

This is the MCS implementation. It is given a Compare object and
returns the Mapping for the best MCS. The graphs to compare are given
by the `graph1` and `graph2` attributes of the Compare object.
For performance reasons, it is best to let graph1 be the smaller
graph.

This provide the conversion between a PyDaylight Molecule and a Graph.
At present the only end-user function is `find_mcs_for_pair`,
which takes two molecules and returns the Mapping class containing the
mapped pairs. The three parameters are the two molecules and an
option comparison class (note: not instance but class). The Graphs
for the two Molecules is created, then the comparison instance is used
by calling the comparison class with the two graphs, and the newly
converted data structures are passed to `mcs.find_mcs`.

Two Compare classes exist for Daylight. The first, CompareDaylight,
simply gets the underlying PyDaylight Atom or Bond and checks if the
their `symbol` or `bondorder` attributes are the same. This
requires a toolkit call for every comparison, so the second class,
CompareDaylightMatrix, precomputes all of the atom and bond matches
and store the results into an array. It also tags the GraphAtoms and
GraphBonds with the index offsets into the array, so that calling `atoms` or `bonds` becomes a simple lookup. There is a tradeoff
for doing this - if the molecules are small, the matrix construction
time dominates the problem, but for bigger molecules storing the
matrix is good. Overall, the matrix implementation is a bit over 6%
faster.

In this chapter I will formalize some of the justifications needed to prove that the algorithm works. In fact, I'll probably go overboard, but that's because it was so much fun.

Let's start with two molecules, represented as connected graphs *G*and *H*. Let *g*_{i} and *h*_{i} be the atoms in the two respective
molecules, and *e*_{i} and *f*_{i} be their bonds. Atoms and bonds are
typed (or colored) so we use the equivalence operator
to
indicate of two atoms or two bonds have the same type.

Consider connected subgraphs and . We define a mapping iff the mapping is one-to-one for the atoms and bonds, and the mapping preserves types. That is:

and

and

Also, for a bond to be mapped, the two atoms which comprise its ends must also be mapped.

The size of the mapping, |*M*|, is the number of atoms in *G* mapped
to *H*. With this definition, the number of bonds in the mapping is
not important, so a mapping of a pentane-line substructure (SMARTS:
CCCCC) with 5 atoms and 4 bonds has a size of 5 while the mapping of a
``tetrahedrane''-like substructure (SMARTS: C12C3C2C13) with 4 atoms
and 6 bonds has a size of 4.

With these definitions we can define a maximum common substructure
between *G* and *H* as the subgraphs
and
mapped
by a largest such mapping *M* (there could be several maximum common
substructures).

A few more definitions:

Let *M*_{1} and *M*_{2} be two mappings. Define
if
implies
.
Then
if
but
.

We say that a mapping *M*_{1} is *fully bonded* (XXX need better
word; too much like fully connected, which means something different
XXX) if there does not exist some
such that
|*M*_{1}| = |*M*_{2}|. In other words, all the possible atoms are used. For any
given *M*_{1} there is a unique *M*_{2} such that
and
|*M*_{1}| = |*M*_{2}|. (Since two atoms are bonded by at most one atoms and
because the mapping is one-to-one.)

Given a mapping
and a
bond
connecting *g*_{i} and *g*_{j}, we say that *e* is *used* by *M* if
and *e* is *unused* by *M* if
(by definition this means
).

Further, if *M* is fully bonded then we say *e* is *pending* on
*M* if one of
but the other is not. If *e* is
not used, not unused and not pending, then *e* is *forbidden*. By
construction, all bonds can be placed into one of these four
categories. The corresponding bond *f* in *H* can be categorized in
the same fashion.

Furthermore, if
and
|*M*_{1}|=|*M*_{2}| and *M*_{2} is fully
bonded, then *M*_{2} can be built from *M*_{1} by identifying which
pending bonds are now used or forbidden and which unused bonds are now
forbidden or pending.

Given some bond
we say that *e* is *totally forbidden*
in *M*_{1} if there does not exist any
where *e* is
used by *M*_{2}. It is straightforward to show that if *e* is forbidden
by some mapping then it is also absolutely forbidden (proof by
contradiction and knowing the mapping is fully bonded). Also, by
definition of an MCS, if *M* is an MCS then all bonds in *G* can be
labeled used or absolutely forbidden by *M*.

Consider some
where *e* is pending and *M* is fully bonded.
If is it not possible to add *e* and its *pending atom* *g*_{j} to
*M* to create a new, valid mapping, then *e* is also absolutely
forbidden. (Note that *e* and *g*_{j} can be mapped to several possible
*f* and *h*_{j}.) More specifically, if no such *e* exists for some *M*,
then *M* describes a *maximal common substructure* (though not
necessarily the *maximum* common substructure).

It is easy to show that for any two totally bonded mappings
there is at least one *M*_{k} such that
and
|*M*_{i}|+1=|*M*_{k}|. Additionally, there is some
*e* in *G* which was pending in *M*_{i} but is used in *M*_{k}.

Combining these last two paragraphs, then for any totally bonded *M*_{0}there exists a progression of totally bonded mappings
where
|*M*_{i}|+1=|*M*_{j}| for *i*+1=*j* and *M*_{n}is maximal.

Recall that for each *M*_{i} (*i*>0) there is a
which was
pending in *M*_{i-1} but used in *M*_{i}. Because for each mapping
there is a unique totally bonded mapping of the same size, we have a
method by which to construct the mapping chain given just *M*_{0} and
the ordered bonds *e*_{i}: for a given
,
create the
(not totally bonded) mapping
which includes the once
pending bond *e*_{i} and its pending atom. From
,
*M*_{i} is
the fully bonded mapping of the same size.

Therefore, all maximal common substructure can be generated from
starting with each unique, smallest fully bonded *M*_{0} (called a *seed*) and trying all possible combinations of pending bonds with
their corresponding bonds in *H*. At least one of these maximal
substructures will be a maximum common substructure.

The generation code described will not produce unique maximal
substructures, but a minor change to the algorithm remedies that.
Consider any fully bonded mapping *M*_{i} which doesn't describe a
maximal substructure. Let *a* and *b* be two possible bonds which can
be used to expand *M*_{i} by another atom. I need to show these will
always generate unique different substructures.

First generate all substructures expanding *M*_{i} by *a*. By the
inductive assumption, these will all be unique. Then mark bond *a*with the additional property that it is *tainted*. (A more
generic word for tainted as used in graph traversals is *visited*.) Now generate all the substructures expanding *M*_{i} by *b*except that if *a*, or more generally, any tainted bond, is included
in the mapping then the search branch can be terminated. This can be
done because all mappings using *a* from *M*_{i} have already been
generated. The resulting substructures are now all unique.

As another possibility, instead of tainting an edge, make it be forbidden. This will still generate common substructures, but the results won't fully bonded, and some of the subgraphs may be subset of others. This is best seen with an example. Consider a triangular substructure with three atoms and three bonds. When two atoms are in a fully bonded subgraph, then two edges are pending. Pick one edge and find all of its maximal structures. Now mark it as forbidden, which effectively removes it from the graph. The other edge is still pending, so it can be traversed. This produces a new subgraph, though with 3 atoms and 2 bonds. But the subgraph is not fully bonded in the original graph and it is a subset of the triangle solutions already found.

Making the edges forbidden is useful for testing the tainted version. The maximum mapped size of either algorithm is the same, and forbidding edges is easier to implement than tainting because the forbid mechanisms needed to be present for the underlying algorithm to work. Since the forbid code generates more substructures than the taint version, the overall time will be slower on average. Our preliminary tests show it to be about 15% slower.

This pairwise implementation was tested a couple of different ways.
The original tests were to ensure that simple graphs are matched
correctly, as `CO` to `COC` and the more complicated test
case of `c1ccc2c(c1)ccc3ccccc32` to `c1ccc2cc3ccccc3cc2c1`.

The more complete tests used a set of 9 SMILES strings describing similar compounds. These are:

- COMPOUNDS OMITTED BECAUSE OF THEIR PROPRIETARY NATURE

The 45 possible combinations were generated using both the tainted and forbidden methods and the MCS sizes compared. This turned out to be a very good way to cross check. After, alas, entirely too many mistakes, I got them to give the same answers. The results of some of the comparisons were eyeball checked to see if they were correct, and they seem to be so.

By the way, the eyeball comparisons used the incomplete TkDepict widget, which is part of PyDaylight. It contains enough implementation to be able to label matching atoms by number and matching edges by letters.

A couple of different comparison functions were used and tested against each other (calling the Daylight toolkit directly or using a precomputed matrix). This was used to test the Compare interface orthogonality to different implementation, but it also found an implementation bug.

The cutoff code was tested by writing using several classes which implemented the Maximal protocol. This was used to check that if all maximal structures were generated, then the best result would still be no bigger than with an earlier cutoff. It was also used to double check that the taint implementation did indeed only find unique mappings (and showed that the forbidden one finds some non-fully bonded solutions).

The N-by-N comparisons generated information about the comparison times and the number of maximal substructures in the test set. The times were generated on val, a two-processor 250 MHz R10000 Octane. One of the processors was fully loaded, so the real times should be slightly smaller. Also, recall that the implementation is in unoptimized Python so expect about a factor of 5 speedup with a rewrite in C.

If every molecule is compared to every other molecule (including
itself) then the average MCS time was 11.7 seconds for the matrix
comparison and 12.5 for the direct toolkit comparison. If a molecule
wasn't compared to itself then the average times moved to 14.4 and
15.7 seconds, since the size cutoff is quickly found in a self-self
MCS. Hmmm, interesting. Using the optimize flag, `-O`, for
Python which gets rid of the assert statements scattered throughout
the code (which were extremely useful to track down bugs!) and the
line number op code in the Python byte code. The times went down to
10.1 and 11.3 seconds.

A scatter plot of time vs. size, where size is one of {the size of the smaller compound, the larger one, or their product}, doesn't show anything significant. Probably need to try more compounds, and add a factor describing ``local symmetries'' like branches to identical atoms, which can conflate the times.

The worst time was 87 seconds for the comparison of these two:

- COMPOUNDS REMOVED BECAUSE OF THEIR PROPRIETARY NATURE

I tried to generate a list of all the maximal substructures (including self-self comparisons). This was not fully successful because of these time needed to compare these two aforementioned structures. It took 55.5 hours to finish (I let it run to completion to ensure that it wasn't stuck in a loop somewhere). Of the results so far, there are about 300 maximal structures on average, with the most being about 700 structures. The average comparison time is 95 seconds.

This document was generated using the
**LaTeX**2`HTML` translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:

**latex2html** `-split 2 mcs`.

The translation was initiated by on 1999-06-10

- Contents
- Algorithm
- Interfaces
- Justifications
- Validation and Comparison Results
- About this document ...