PyMCS
Andrew Dalke
June 10, 1999

Bioreason, Inc.
150 Washington Ave., Suite 303
Santa Fe, NM 87501
http://www.bioreason.com/




Description

MCS is a Python package to find the maximum common substructure between two compounds or, more generally, find all maximal common substructures between two graphs with colored nodes (``atoms'') and edges (``bonds''). For a test set of similar compounds of size roughly 25 atoms, the average pairwise MCS compute time is about 10 seconds.

This document contains architecture and design information along with the mathematical justification for the algorithm, and some initial results. The text was excerpted from our internal documentation and some parts have been edited to remove proprietary information.


Contents

Algorithm

The problem is how to find a maximum common substructure of two molecules. The details of the data structures and the justification for how the code works are described later. For now, here is the meat of the way things work.

Everything resolves around what I call a fully bonded mapping between the two graphs. Simply put, this means that there is a one-to-one and connected mapping between some atoms from the first graph to the second, and if atoms are mapped to each other then they must be ``equivalent.'' Also, if two atoms are bonded in the first graph and their mapped atoms are bonded in the second graph, and the two bonds are ``equivalent'' then there is a mapping between those two atoms. The definition of equivalent is specified by the user.

The goal is to find the largest such mapping (others have different definitions of MCS).

The algorithm works by growing all possible maximal mappings from all possible seeds. A seed is a simple mapping of one atom in the first graph to one in the second.

A mapping is grown by looking at all the bonds between an atom in the mapping to one outside (these are called pending bonds). Pick one of the pending bonds in the first graph. Next find the set of all possible pending bonds in the other graph which can be mapped to it. Since one of the atoms are already mapped, this is pretty easy.

For each term in this set of possibles, add the mapping between the two pending bonds as well as their corresponding pending atoms. Sometimes the newly added atom had other edges which were pending. Since now both atoms are in the subgraph, it is easy to check if there is a bond in the second graph corresponding to it. Add all of these bonds to the mapping, and the result is a newly grown graph which is itself fully bonded. This means we can apply this algorithm recursively until it grows no farther. Recall that each possible mapping of pending bonds can be used to grow the mapping, so there can be branching at each recursion level.

If cycles are in the mapping (and there are few compounds with no cycles) then this method will generate duplicate solutions. The output can be made unique by keeping a list of all the previously tried extensions from a given mapping as a list of tainted edges. If one of them is used again, then since all possible solutions using it have already been generated, the algorithm can backtrack.

This algorithm will generate all of the maximal substructures, that is, the structures which cannot be expanded by another atom and still meet the fully bonded criterion. If the goal is to find a maximum common substructures then another cutoff can be gained by comparing the largest maximal structure found to date and, stopping if it is larger than any possible remaining substructure. (Eg, if the best mapping to date has 6 atoms and there are only 5 possible atoms to use for new seeds, then the code can stop searching.)

Interfaces

Graph Module

The MCS code manipulates a graph. A graph has a list of atoms and a list of bonds. Each atom has a list of bonds connected to it, and each bond has a list of the two atoms which it connects. Additionally, if an atom or bond is referenced twice, the reference is to the same object. Eg, the bonds from an atom in a graph are the same objects listed as bonds in the graph. This is to distinguish from systems (like the PyDaylight Molecule class) where the objects are equal, but not the same object. In Python nomenclature, if x == y then x is y.

The MCS data structures are Graph, GraphAtom and GraphBond and are given in Graph.py. In order to build a Graph, the atom must have a method which adds a new bond to the list. This is needed because of the circular data structures involved - there is no way to build the graph in one shot. Atoms and bonds have an attribute called type which store the user-supplied type information needed to compare two atoms or bonds. This contents of the type field are not important to the algorithm.

The MCS code uses two bond classifications for the MCS implementation. A bond can be unused, used, pending or forbidden. It can also be tainted or untainted. The first attribute is stored as the bond's status and the second is the tainted flag.

The MCS algorithm only directly accesses the list of unused and pending bonds for a given atom, so the status information is never used directly and could be removed. (That value is used for assertion checking so is still present in the code.) Instead, the list of unused and pending bonds are stored in the atom. The graph is in charge of modifying these lists to reflect changes.

Bonds have a method called xatom which is given one of the atoms in the bond and returns the other atom. If a bond is pending, then it has the additional methods to return the pending atom (the atom which isn't in the currently mapped subgraph) and to return the tuple of the non-pending and pending atoms.

As mentioned, the graph is in charge of changing the list of unused and pending bonds listed by an atom. They enforce through assertions the evolution path of bonds through the MCS (eg, a bond can only become used if previously it was pending, so use_bond checks first that the bond is pending). The methods which directly change an edge's status are use_bond, forbid_bond and make_pending_bonds_forbidden, and pend_bond and make_unused_bonds_pending. Also, a graph stores the full list of used and pending bonds.

Another method, taint_bond, is used to set the bond's tainted flag.

You'll notice in this list that there are no reverse methods, eg, to untaint an edge or to make a forbidden bond pending. This is because the algorithm does not need that functionality. However, there are steps where it expands the subgraph mapping and needs the ability to revert the graph to its original form (eg, during the recursive descent step). One implementation would duplicate the mapping information for each bond extension, but that overhead was deemed too large.

Instead, a graph is modified in-place and the previous state is restored through a mark and free system. A graph contains a stack of changes. When mark is called, the size of the stack is noted. As edges are changed, the change information is pushed on the stack. When pop is called, the changes which occurred since the previous mark are used to revert the graph back to its original form.

Finally, the graph contains a method to find the edge between two given atoms, if it exists, and a destructor, which is needed to break the cycles in the data structure.

One other function is available in the Graph module, graph_convert. This converts data structures which have a graph-like topology of atoms and bond into a proper Graph. The type attributes of GraphAtom and GraphBond become the originating atoms and bonds, and it enforces that if x == y then x is y. It is primarily used to convert the PyDaylight Molecule into Graph form.

Mapping Module

The Mapping class keeps track of the atom-to-atom and bond-to-bond mappings. The methods append_atoms and append_bonds are used to add a new correspondence between their two respective data types. As with the Graph class, a mark and free system is used, so mappings added since the last mark call are removed by calling pop.

A list of all the mapped atom or bond pairs is returned with the atoms and bonds methods. The MCS code needs the ability to find the 2nd atom or bond mapped by the first one. These are found with get_other_atom and get_other_bond, which raise an ValueError exception if the item isn't listed.

Compare protocol

The Compare class is an abstract base class and as such doesn't need to be declared in Python - what's important is an object which implements the Compare ``protocol.'' This is used to compare type information between atoms and bonds and by using it we move comparison code out of the main MCS routine and into the caller. Thus, the user can tweak how the graph items should be compared, say by including test for aromaticity, or bond type, or valence, or atom type, etc.

An object which implements the Compare protocol must have four methods and store two attributes. The constructor is passed the two Graphs being compared. This is useful, eg, to precompute the results. The graphs must be stored as graph1 and graph2.

The methods atoms and bonds are used to compare two GraphAtoms or GraphBonds. The atom or bond from the first graph is given first. The result must be 0 to indicate that they are different or 1 to indicate they are the same.

The method matching_atoms is passed an atom from the first graph. It must return all the atoms in the second graph which could potentially map to it.

I'll point out again that GraphAtoms and GraphBonds are used in the call parameters. If specific type information is needed, the original data can be referenced through the type attribute.

mcs Module

Maximal protocol

The Maximal class is also an abstract base class. It implements a constructor and the methods maximal and set_limit.

The maximal method is called whenever the MCS algorithm identifies a maximal common substructure. It takes the maximal mapping as its sole parameter. This is a mutable object (it can change in the future) so should be cloned if needed later. The method can be used to determine if a ``good enough'' solution is found, to store the best solution to date, to print out the results, etc.

If the solution is good enough, then it can raise the exception GoodEnough, which terminates the search. Usually this is called when the best MCS to date is larger than any possible future MCS, as when the number of search atoms is smaller than the mapping size.

The instance is initialized with the size of the largest possible mapping, which is the size of the smaller of the two compounds. As atoms are eliminated, the method set_limit is called with the number of atoms remaining. If this new limit becomes smaller than the best MCS so far, and only one maximum substructure is needed, then GoodEnough should be raised.

A Maximal object must also store a mapping as the attribute mapping. This is what is returned by find_mcs. At present the Maximal is created inside of find_mcs. In the future, the Maximal object will be passed into the function, which gives more end-user flexibility. When that becomes the case, this attribute requirement will disappear.

find_mcs

This is the MCS implementation. It is given a Compare object and returns the Mapping for the best MCS. The graphs to compare are given by the graph1 and graph2 attributes of the Compare object. For performance reasons, it is best to let graph1 be the smaller graph.

Daylight Module

This provide the conversion between a PyDaylight Molecule and a Graph. At present the only end-user function is find_mcs_for_pair, which takes two molecules and returns the Mapping class containing the mapped pairs. The three parameters are the two molecules and an option comparison class (note: not instance but class). The Graphs for the two Molecules is created, then the comparison instance is used by calling the comparison class with the two graphs, and the newly converted data structures are passed to mcs.find_mcs.

Two Compare classes exist for Daylight. The first, CompareDaylight, simply gets the underlying PyDaylight Atom or Bond and checks if the their symbol or bondorder attributes are the same. This requires a toolkit call for every comparison, so the second class, CompareDaylightMatrix, precomputes all of the atom and bond matches and store the results into an array. It also tags the GraphAtoms and GraphBonds with the index offsets into the array, so that calling atoms or bonds becomes a simple lookup. There is a tradeoff for doing this - if the molecules are small, the matrix construction time dominates the problem, but for bigger molecules storing the matrix is good. Overall, the matrix implementation is a bit over 6% faster.

Justifications

In this chapter I will formalize some of the justifications needed to prove that the algorithm works. In fact, I'll probably go overboard, but that's because it was so much fun.

Let's start with two molecules, represented as connected graphs Gand H. Let gi and hi be the atoms in the two respective molecules, and ei and fi be their bonds. Atoms and bonds are typed (or colored) so we use the equivalence operator $\equiv$ to indicate of two atoms or two bonds have the same type.

Consider connected subgraphs $G^\prime\subseteq G$ and $H^\prime\subseteq H$. We define a mapping $M:G^\prime\mapsto H^\prime$iff the mapping is one-to-one for the atoms and bonds, and the mapping preserves types. That is:

$M(g_i\in G^\prime) = h_i\in H^\prime$ and $g_i\equiv h_i$

$M(e_j\in G^\prime) = f_j\in H^\prime$ and $e_j\equiv f_j$

Also, for a bond to be mapped, the two atoms which comprise its ends must also be mapped.

The size of the mapping, |M|, is the number of atoms in G mapped to H. With this definition, the number of bonds in the mapping is not important, so a mapping of a pentane-line substructure (SMARTS: CCCCC) with 5 atoms and 4 bonds has a size of 5 while the mapping of a ``tetrahedrane''-like substructure (SMARTS: C12C3C2C13) with 4 atoms and 6 bonds has a size of 4.

With these definitions we can define a maximum common substructure between G and H as the subgraphs $G^\prime$ and $H^\prime$ mapped by a largest such mapping M (there could be several maximum common substructures).

A few more definitions:

Let M1 and M2 be two mappings. Define $M_1\subseteq M_2$ if $M_1(x\in G) = y$ implies $M_2(x\in G) = y$. Then $M_1\subset M_2$ if $M_1\subseteq M_2$ but $M_2\not\subseteq M_1$.

We say that a mapping M1 is fully bonded (XXX need better word; too much like fully connected, which means something different XXX) if there does not exist some $M_2\supset M_1$ such that |M1| = |M2|. In other words, all the possible atoms are used. For any given M1 there is a unique M2 such that $M_1\subseteq M_2$ and |M1| = |M2|. (Since two atoms are bonded by at most one atoms and because the mapping is one-to-one.)

Given a mapping $M(G^\prime\subseteq G)\mapsto H^\prime\subseteq H$ and a bond $e\in G$ connecting gi and gj, we say that e is used by M if $e\in G^\prime$ and e is unused by M if $g_i, g_j\not\in G^\prime$ (by definition this means $e\not\in
G^\prime$).

Further, if M is fully bonded then we say e is pending on M if one of $g_i, g_j\in G^\prime$ but the other is not. If e is not used, not unused and not pending, then e is forbidden. By construction, all bonds can be placed into one of these four categories. The corresponding bond f in H can be categorized in the same fashion.

Furthermore, if $M_1\subset M_2$ and |M1|=|M2| and M2 is fully bonded, then M2 can be built from M1 by identifying which pending bonds are now used or forbidden and which unused bonds are now forbidden or pending.

Given some bond $e\in M_1$ we say that e is totally forbidden in M1 if there does not exist any $M_2\supset M_1$ where e is used by M2. It is straightforward to show that if e is forbidden by some mapping then it is also absolutely forbidden (proof by contradiction and knowing the mapping is fully bonded). Also, by definition of an MCS, if M is an MCS then all bonds in G can be labeled used or absolutely forbidden by M.

Consider some $e\in M$ where e is pending and M is fully bonded. If is it not possible to add e and its pending atom gj to M to create a new, valid mapping, then e is also absolutely forbidden. (Note that e and gj can be mapped to several possible f and hj.) More specifically, if no such e exists for some M, then M describes a maximal common substructure (though not necessarily the maximum common substructure).

It is easy to show that for any two totally bonded mappings $M_i\subset M_j$ there is at least one Mk such that $M_i\subset
M_k\subseteq M_j$ and |Mi|+1=|Mk|. Additionally, there is some e in G which was pending in Mi but is used in Mk.

Combining these last two paragraphs, then for any totally bonded M0there exists a progression of totally bonded mappings $M_0\subset
M_1\subset ...\subset M_n$ where |Mi|+1=|Mj| for i+1=j and Mnis maximal.

Recall that for each Mi (i>0) there is a $e_i\in G$ which was pending in Mi-1 but used in Mi. Because for each mapping there is a unique totally bonded mapping of the same size, we have a method by which to construct the mapping chain given just M0 and the ordered bonds ei: for a given $M_{i-1} (0<i\le n)$, create the (not totally bonded) mapping $M_i^\prime$ which includes the once pending bond ei and its pending atom. From $M_i^\prime$, Mi is the fully bonded mapping of the same size.

Therefore, all maximal common substructure can be generated from starting with each unique, smallest fully bonded M0 (called a seed) and trying all possible combinations of pending bonds with their corresponding bonds in H. At least one of these maximal substructures will be a maximum common substructure.

The generation code described will not produce unique maximal substructures, but a minor change to the algorithm remedies that. Consider any fully bonded mapping Mi which doesn't describe a maximal substructure. Let a and b be two possible bonds which can be used to expand Mi by another atom. I need to show these will always generate unique different substructures.

First generate all substructures expanding Mi by a. By the inductive assumption, these will all be unique. Then mark bond awith the additional property that it is tainted. (A more generic word for tainted as used in graph traversals is visited.) Now generate all the substructures expanding Mi by bexcept that if a, or more generally, any tainted bond, is included in the mapping then the search branch can be terminated. This can be done because all mappings using a from Mi have already been generated. The resulting substructures are now all unique.

As another possibility, instead of tainting an edge, make it be forbidden. This will still generate common substructures, but the results won't fully bonded, and some of the subgraphs may be subset of others. This is best seen with an example. Consider a triangular substructure with three atoms and three bonds. When two atoms are in a fully bonded subgraph, then two edges are pending. Pick one edge and find all of its maximal structures. Now mark it as forbidden, which effectively removes it from the graph. The other edge is still pending, so it can be traversed. This produces a new subgraph, though with 3 atoms and 2 bonds. But the subgraph is not fully bonded in the original graph and it is a subset of the triangle solutions already found.

Making the edges forbidden is useful for testing the tainted version. The maximum mapped size of either algorithm is the same, and forbidding edges is easier to implement than tainting because the forbid mechanisms needed to be present for the underlying algorithm to work. Since the forbid code generates more substructures than the taint version, the overall time will be slower on average. Our preliminary tests show it to be about 15% slower.

Validation and Comparison Results

Validation

This pairwise implementation was tested a couple of different ways. The original tests were to ensure that simple graphs are matched correctly, as CO to COC and the more complicated test case of c1ccc2c(c1)ccc3ccccc32 to c1ccc2cc3ccccc3cc2c1.

The more complete tests used a set of 9 SMILES strings describing similar compounds. These are:

The smallest molecule has 16 atoms while the largest has 37. The average size is 24 atoms.

The 45 possible combinations were generated using both the tainted and forbidden methods and the MCS sizes compared. This turned out to be a very good way to cross check. After, alas, entirely too many mistakes, I got them to give the same answers. The results of some of the comparisons were eyeball checked to see if they were correct, and they seem to be so.

By the way, the eyeball comparisons used the incomplete TkDepict widget, which is part of PyDaylight. It contains enough implementation to be able to label matching atoms by number and matching edges by letters.

A couple of different comparison functions were used and tested against each other (calling the Daylight toolkit directly or using a precomputed matrix). This was used to test the Compare interface orthogonality to different implementation, but it also found an implementation bug.

The cutoff code was tested by writing using several classes which implemented the Maximal protocol. This was used to check that if all maximal structures were generated, then the best result would still be no bigger than with an earlier cutoff. It was also used to double check that the taint implementation did indeed only find unique mappings (and showed that the forbidden one finds some non-fully bonded solutions).

Comparison Results

The N-by-N comparisons generated information about the comparison times and the number of maximal substructures in the test set. The times were generated on val, a two-processor 250 MHz R10000 Octane. One of the processors was fully loaded, so the real times should be slightly smaller. Also, recall that the implementation is in unoptimized Python so expect about a factor of 5 speedup with a rewrite in C.

If every molecule is compared to every other molecule (including itself) then the average MCS time was 11.7 seconds for the matrix comparison and 12.5 for the direct toolkit comparison. If a molecule wasn't compared to itself then the average times moved to 14.4 and 15.7 seconds, since the size cutoff is quickly found in a self-self MCS. Hmmm, interesting. Using the optimize flag, -O, for Python which gets rid of the assert statements scattered throughout the code (which were extremely useful to track down bugs!) and the line number op code in the Python byte code. The times went down to 10.1 and 11.3 seconds.

A scatter plot of time vs. size, where size is one of {the size of the smaller compound, the larger one, or their product}, doesn't show anything significant. Probably need to try more compounds, and add a factor describing ``local symmetries'' like branches to identical atoms, which can conflate the times.

The worst time was 87 seconds for the comparison of these two:

The problem is likely with the number of rings in the second molecule along with the the high number of similar atoms (carbons) and bonds (aromatic). They create a large number of alternate mappings so with a search depth of, say, 10, there are 410 possible mappings.

I tried to generate a list of all the maximal substructures (including self-self comparisons). This was not fully successful because of these time needed to compare these two aforementioned structures. It took 55.5 hours to finish (I let it run to completion to ensure that it wasn't stuck in a loop somewhere). Of the results so far, there are about 300 maximal structures on average, with the most being about 700 structures. The average comparison time is 95 seconds.

About this document ...

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 2 mcs.

The translation was initiated by on 1999-06-10


Subsections

1999-06-10