The CDK uses the UniversalIsomorphismTester to perform graph and subgraph isomorphism. However it’s not very efficient and this shows when performing substructure searches over large collections. A quick test where I compared the CDK code to OpenBabel’s obgrep showed that the CDK is nearly forty times slower than OpenBabel. Improvements in this code will enhance SMARTS matching, pharmacophore searching, fingerprinting and descriptors.
The Ullman algorithm is a well known method to perform subgraph isomorphism and even though more than thirty years old, is still used in many applications. I implemented this algorithm, based on the C++ implementation in VFLib, to see whether it’d do better than the method currently used in the CDK.
The results of some initial performance tests are listed below. I timed (using System.currentTimeMillis) a 1000 runs of the substructure search using each of the methods. This test only considered substructure searching using connectivity, ignoring atom and bond characteristics. A
|Target||Query||Ullman Time (ms)||UIT Time (ms)|
While the results look very impressive, the UIT code is finding all subgraph mappings, whereas the Ullman implementation is returning after finding the first one. For certain purposes this is useful (say for something like obgrep), but is obviously not a general solution. It’s also interesting to note that I’m not using the refinement step described by Ullman and implemented in VFLib,. With that I might be able to shave off a few more milliseconds.
Right now the code is not integrated into the CDK – before that it’ll need more work to get all mappings found and extensive testing to check for correctness (at least it differentiates between cyclopropane and isobutane!).