In my last post I had reported some timing measurements for various operations. One of them was fingerprinting using the path-based hashing Fingerprinter class in the CDK. As reported, it took nearly 4 minutes to process a 1000-molecule subset of ZINC. Not good.
So I spent a little time last night hacking on the code, primarily making the search for unique paths a little faster. Happily, my latest commit (in 1.2.x, should be merged into trunk soon) allows the fingerprinter to process 1000 molecules in approximately 59s – a 4X speed up.
In terms of behavior, the new code gets the exact same paths as the old code, the only difference being that the order of atoms in the path can be reversed. Since the fingerprint is generated by hashing “path strings”, this means that the fingerprints from the new code will differ slightly from the old code. So if you’re working witha bunch of fingerprints calculated with the old code, you should probably regenarate them with the new code.