fingerprint | So much to do, so little time

Posts Tagged ‘fingerprint’

The Speedups Keep on Coming

Posted in cheminformatics, software, tagged benchmark, cdk, fingerprint, performance on December 4, 2008| 7 Comments »

A while back I wrote about some updates I had made to the CDK fingerprinting code to improve performance. Recently Egon and Jonathan Alvarsson (Uppsala) had made even more improvements. Some of them are simple fixes (making a String[] final, using Set rather than List) while others are more significant (efficient caching of paths). In combination, they have improved performance by over 50%, compared to my last update. Egon has put up a nice summary of performance runs here. Excellent work guys!

Read Full Post »

Do the CDK Fingerprints Work?

Posted in cheminformatics, software, tagged benchmark, cdk, enrichment, fingerprint, pubchem, similarity on October 11, 2008| 4 Comments »

In a previous post, I dicussed virtual screening benchmarks and some new public datasets for this purpose. I recently improved the performance of the CDK hashed fingerprints and the next question that arose is whether the CDK fingerprints are any good. With these new datasets, I decided to quantitatively measure how the CDK fingerprints compare to some other well known fingerprints.

Update – there was a small bug in the calculations used to generate the enrichment curves in this post. The bug is now fixed. The conclusions don’t change in a significant way. To get the latest (and more) results you should take a look here.

(more…)

Read Full Post »

Working With Fingerprints in R (can’t beat C!)

Posted in cheminformatics, software, tagged benchmark, c++, CRAN, fingerprint, R, similarity on October 11, 2008| Leave a Comment »

Since I do a lot of cheminformatics work in R, I’ve created various functions and packages that make life easier for me as do my modeling and analysis. Most of them are for private consumption. However, I’ve released a few of them to CRAN since they seem to be generally useful.

One of them is the fingerprint package (version 2.9 was just uploaded to CRAN) , that is designed to read and manipulate fingerprint data generated from various cheminformatics toolkits or packages. Right now it supports output from the CDK, BCI and MOE. Fingerprints are represented using S4 classes. This allows me to override the R logical operators, so that one can do things like compute the logical OR of two fingerprints.

(more…)

Read Full Post »

Which Bits are Important for Similarity Searches?

Posted in cheminformatics, research, tagged fingerprint, maccs, similarity, tanimoto on October 6, 2008| Leave a Comment »

The recent paper by Wang and Bajorath is an interesting approach to identifying the important bits in a fingerprint, with respect to a dataset.

Their discussion focuses on the structural key type fingerprints (such as MACCS and the BCI fingerprints) and the problem they are trying to address is the fact that certain structural features may be more important for similarity searching than others. This is also related to the fact that molecular complexity (i.e., the number of structural features) can lead to bias in similarity calculations [1]. Given a dataset, an easy way to identify the important bits is the so called consensus approach [2, 3]- basically find out which bit positions are set to 1 for all (or a specified fraction) of the dataset. While useful, this can be misled if the target dataset has many molecules with a large number of structural features (so that many bits in the fingerprint will be set to 1).

(more…)

Read Full Post »

Faster Fingerprinting

Posted in software, tagged cdk, dfs, fingerprint, hash, optimize, path, performance on September 12, 2008| 3 Comments »

In my last post I had reported some timing measurements for various operations. One of them was fingerprinting using the path-based hashing Fingerprinter class in the CDK. As reported, it took nearly 4 minutes to process a 1000-molecule subset of ZINC. Not good.

So I spent a little time last night hacking on the code, primarily making the search for unique paths a little faster. Happily, my latest commit (in 1.2.x, should be merged into trunk soon) allows the fingerprinter to process 1000 molecules in approximately 59s – a 4X speed up.

In terms of behavior, the new code gets the exact same paths as the old code, the only difference being that the order of atoms in the path can be reversed. Since the fingerprint is generated by hashing “path strings”, this means that the fingerprints from the new code will differ slightly from the old code. So if you’re working witha bunch of fingerprints calculated with the old code, you should probably regenarate them with the new code.

Read Full Post »

So much to do, so little time

Trying to squeeze sense out of chemical data

Posts Tagged ‘fingerprint’

The Speedups Keep on Coming

Do the CDK Fingerprints Work?

Working With Fingerprints in R (can’t beat C!)

Which Bits are Important for Similarity Searches?

Faster Fingerprinting

Archives

Tag Cloud

Meta