Sometime back I was playing around with dynamic HTML and cam across a tutorial that described how to implement the dynamic suggestion feature that is commonly found on many websites (such as Google and Amazon). This set me wondering how I could use this mechanism to dynamically depict a SMILES string as I type it.
Archive for October, 2008
I came across an interesting site called the World Names Profiler, which given a surname colors a map of the world based on frequency of occurence of the name in different countries. They have a dataset of 300 million names across 26 countries.
While it’s a nice visualization, it was very interesting for me to see the spread of Indian surnames, as the Indian diaspora is spread out all over the globe. Obviously Indian surnames have a maximum frequency in India, but it’s quite interesting to note that Guha has a high frequency in North America and Central Europe and a very low frequency in Australia. I was also surprised to see that it had a non-zero occurrence in Argentina. On the other hand, Ghosh, is has a higher frequency in Canada compared to the US and a higher frequency in Argentina than Guha. However, Patel, has a much higher frequency in Australia than either Guha or Ghosh. Singh on the other hand, appears to have similar frequencies in Canada and Australia, which are both higher than in the US
I chose these surname because they’re pretty common Indian surnames. One could correlate frequencies of occurrence to the background represented by the surnames, but that would be easily confounded by stereotypes. However, for me, it’s a nice visualization of how Indians have spread over the globe.
In a previous post, I dicussed virtual screening benchmarks and some new public datasets for this purpose. I recently improved the performance of the CDK hashed fingerprints and the next question that arose is whether the CDK fingerprints are any good. With these new datasets, I decided to quantitatively measure how the CDK fingerprints compare to some other well known fingerprints.
Update – there was a small bug in the calculations used to generate the enrichment curves in this post. The bug is now fixed. The conclusions don’t change in a significant way. To get the latest (and more) results you should take a look here.
Since I do a lot of cheminformatics work in R, I’ve created various functions and packages that make life easier for me as do my modeling and analysis. Most of them are for private consumption. However, I’ve released a few of them to CRAN since they seem to be generally useful.
One of them is the fingerprint package (version 2.9 was just uploaded to CRAN) , that is designed to read and manipulate fingerprint data generated from various cheminformatics toolkits or packages. Right now it supports output from the CDK, BCI and MOE. Fingerprints are represented using S4 classes. This allows me to override the R logical operators, so that one can do things like compute the logical OR of two fingerprints.
Virtual screening (VS) is a common task in the drug discovery process and is a computational method to identify promising compounds from a collection of hundreds to millions of possible compounds. What “promising” exactly means, depends on the context – it might be compounds that will likely exhibit certain pharmacological effects. Or compounds that are expected to non-toxic. Or combinations of these and other properties. Many methods are available for virtual screening including similarity, docking and predictive models.
So, given the plethora of methods which one do we use? There are many factors affecting choice of VS method including availability, price, computational cost and so on. But in the end, deciding which one is better than another depends on the use of benchmarks. There are two features of VS benchmarks: the metric employed to decide whether one method is better than another and the data used for benchmarking. This post focuses on the latter aspect.
The recent paper by Wang and Bajorath is an interesting approach to identifying the important bits in a fingerprint, with respect to a dataset.
Their discussion focuses on the structural key type fingerprints (such as MACCS and the BCI fingerprints) and the problem they are trying to address is the fact that certain structural features may be more important for similarity searching than others. This is also related to the fact that molecular complexity (i.e., the number of structural features) can lead to bias in similarity calculations . Given a dataset, an easy way to identify the important bits is the so called consensus approach [2, 3]- basically find out which bit positions are set to 1 for all (or a specified fraction) of the dataset. While useful, this can be misled if the target dataset has many molecules with a large number of structural features (so that many bits in the fingerprint will be set to 1).
Pub3D is a 3D version of PubChem, in which we have generated a single conformer for 99% of PubChem using the smi23d suite of programs. The structures are then stored in a PostgreSQL database along with their distance moment shape descriptors described by Ballester and Graham-Richards. This allows us to perform shape similarity queries against a user supplied 3D structure. By partitioning the database (thanks to the CGL folks at IU) and using a spatial index, performance is quite snappy. (I had briefly mentioned this in a presentation at the ACS meeting, last spring).
The database had been down for some time, so today I got it back up and running and AJAX’ified the interface, to make it look a little nicer. jQuery rocks! (OK, the color scheme sucks)
There are obvious drawbacks to the current database – single conformer shape search is not very rigorous, especially since the stored structures are not necessarily the minimum energy conformer. However, we have started generating multiple conformers, so hopefully we’ll address this issue in time. The bigger issue is how this approach to shape similarity compares to other well known approaches such as ROCS. Clearly, a shape descriptor approach is lower resolution to a volumetric approach such as ROCS, so in that sense the results are ‘rougher’. However visual inspection of some searches seems to indicate that it isn’t too bad. The paper describing these shape descriptors didn’t do a rigorous comparison – that’s on our TODO list.
OK, the fun part (a.k.a, coding) is done for now – got to get back to the paper.