A while back I wrote about some updates I had made to the CDK fingerprinting code to improve performance. Recently Egon and Jonathan Alvarsson (Uppsala) had made even more improvements. Some of them are simple fixes (making a String final, using Set rather than List) while others are more significant (efficient caching of paths). In combination, they have improved performance by over 50%, compared to my last update. Egon has put up a nice summary of performance runs here. Excellent work guys!
Posts Tagged ‘benchmark’
A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.
I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .
“Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”
Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.
Sometime back I described how I was porting the VFLib algorithms to Java, so that we could use it for substructure search, since the current UniversalIsomorphismTester is pretty slow for this task, in general. While I had translated the Ullman algorithm implementation of VFLib and shown that it outperformed the CDK method, it turned out that didn’t work for certain cases such as finding CCC in C1CC1. This was due to a different definition of isomorphism that VFLib used. Instead, I tried to convert the VF2 implementation to Java. The motivation was that it does indeed perform substructure matching as is usually understood in cheminformatics, and was also reported to be extremely fast. Unfortunately, my translation was buggy and I put it on hold.
In a previous post, I dicussed virtual screening benchmarks and some new public datasets for this purpose. I recently improved the performance of the CDK hashed fingerprints and the next question that arose is whether the CDK fingerprints are any good. With these new datasets, I decided to quantitatively measure how the CDK fingerprints compare to some other well known fingerprints.
Update – there was a small bug in the calculations used to generate the enrichment curves in this post. The bug is now fixed. The conclusions don’t change in a significant way. To get the latest (and more) results you should take a look here.
Since I do a lot of cheminformatics work in R, I’ve created various functions and packages that make life easier for me as do my modeling and analysis. Most of them are for private consumption. However, I’ve released a few of them to CRAN since they seem to be generally useful.
One of them is the fingerprint package (version 2.9 was just uploaded to CRAN) , that is designed to read and manipulate fingerprint data generated from various cheminformatics toolkits or packages. Right now it supports output from the CDK, BCI and MOE. Fingerprints are represented using S4 classes. This allows me to override the R logical operators, so that one can do things like compute the logical OR of two fingerprints.
Virtual screening (VS) is a common task in the drug discovery process and is a computational method to identify promising compounds from a collection of hundreds to millions of possible compounds. What “promising” exactly means, depends on the context – it might be compounds that will likely exhibit certain pharmacological effects. Or compounds that are expected to non-toxic. Or combinations of these and other properties. Many methods are available for virtual screening including similarity, docking and predictive models.
So, given the plethora of methods which one do we use? There are many factors affecting choice of VS method including availability, price, computational cost and so on. But in the end, deciding which one is better than another depends on the use of benchmarks. There are two features of VS benchmarks: the metric employed to decide whether one method is better than another and the data used for benchmarking. This post focuses on the latter aspect.