Archive for September, 2008

The other day I was reading a paper and as is my habit, while reading I flip to see what papers are being cited. Since this was an ACS journal, the references are listed in the order that they occur in the text. When the authors were discussing a point in the paper, they’d usually include a number of references. Given the ordering of the references, this implies that related references are grouped together in the bibliography.

This set me thinking – given a set of references and their citations within a paper, we can capture relationships between the references in various ways. Most obviously, one might analyze the  cited papers (either in whole, or in part such as just the abstract or title) and draw conclusions.

However, the fact that the authors of the paper considered references X,  Y and Z to be related to a specific point already provides us with some information. Thus  in a bibliography where references are order based on first occurrence, can we use the “locality” of the references in the list to draw any conclusions? One could employ some form of a sliding window and look at groups of references. The key thing here would be to have a way to characterize a reference – so it’d probably require that you can access the title (or better, the abstract or full text) of the paper being cited. I will admit that I’m not sure what sort of conclusions one might draw from such an analysis – but it was interesting to observe “local behavior” in a list of references.

Not having followed work in bibliometrics, I’m sure someone has already thought of this and looked into it. If anybody has heard of stuff like this, I’d appreciate any pointers.

(Of course this is all moot, if we can’t easily access the paper itself)

Read Full Post »

Moving to SlideShare

Finally got round to putting a number of my slides onto SlideShare. While I was skeptical initially, I’ve found it quite handy to quickly browse through a presentation without having to download PDF’s or PPT’s and start up the viewers. Also this lets me not have to maintain a webpage listing all the presentations I’ve made, though it’ll probably still be there for the near future. However a new SlideShare widget lets me put up a single interface to all my presentations – it looks like a nice way to let users browse all the presentations (or a subset of them if I do some grouping)

Read Full Post »

The Chemical Descriptors Library (CDL) has been around for a while, but hasn’t seemed to get much publicity. A paper describing the design and performance of the library just came out today. While the name suggests a library of descriptors, it’s actually a general C++ library for cheminformatics. The library appears to use the molecular graph as its core concept and uses the Boost Graph Library (BGL) to represent and manipulate molecular graphs. Some features include substructure searching using SMARTS, fingerprints, descriptors (CATS, a bunch of topological’s etc) and file format reading (SMILES and SDF as far as I can see).

It seems nice and is available under the Boost Software License. While it does a lot of the basic operations, it doesn’t appear as comprehensive as say OpenBabel or RDKit. However, it’s good to see the cheminformatics toolkit ecosystem growing.

An aside – I haven’t really done much C++ coding and what little I do is basically ‘C in C++’. But how do people get their heads around C++ templates? I tend to get a headache when trying to examine one. And I thought that writing Java was tedious – C++ with templates takes the cake!

Update – Their Sourceforge project page is here but I can’t seem to find a download link.  A software paper with no software!

Read Full Post »

The CDK uses the UniversalIsomorphismTester to perform graph and subgraph isomorphism. However it’s not very efficient and this shows when performing substructure searches over large collections. A quick test where I compared the CDK code to OpenBabel’s obgrep showed that the CDK is nearly forty times slower than OpenBabel. Improvements in this code will enhance SMARTS matching, pharmacophore searching, fingerprinting and descriptors.

The Ullman algorithm is a well known method to perform subgraph isomorphism and even though more than thirty years old, is still used in many applications. I implemented this algorithm, based on the C++ implementation in VFLib, to see whether it’d do better than the method currently used in the CDK.


Read Full Post »

This paper by Prof. Tim Pederson in the Journal of Computational Linguistics highlights the need for authors of computational linguistics papers to release working software that can be used to reproduce results in their papers.

While the paper focuses on the field of computational linguistics (CL), the discussion is perfectly applicable to other fields that publish computational research. Given my background in cheminformatics, which is heavily dependent on the use of computational tools, the points raised in the paper very applicable. For example, Prof. Pederson states

While we have table after table of results to pore over, we usually don’t have access to the software that would allow us to reproduce those results.

He also highlights four points on how to produce software to reproduce the results of research.  In this post, I wanted to highlight some aspects that have bugged me in the past and I think are important for transparency in computational research.


Read Full Post »

Faster Fingerprinting

In my last post I had reported some timing measurements for various operations. One of them was fingerprinting using the path-based hashing Fingerprinter class in the CDK. As reported, it took nearly 4 minutes to process a 1000-molecule subset of ZINC. Not good.

So I spent a little time last night hacking on the code, primarily making the search for unique paths a little faster. Happily, my latest commit (in 1.2.x, should be merged into trunk soon) allows the fingerprinter to process 1000 molecules in approximately 59s – a 4X speed up.

In terms of behavior, the new code gets the exact same paths as the old code, the only difference being that the order of atoms in the path can be reversed. Since the fingerprint is generated by hashing “path strings”, this means that the fingerprints from the new code will differ slightly from the old code. So if you’re working witha bunch of fingerprints calculated with the old code, you should probably regenarate them with the new code.

Read Full Post »

CDK Performance Measurements

As part of a larger project, I’ve been doing some profiling on various aspects of the CDK, focusing on core cheminformatics operations. I’m using the excellent YourKit profiler to do the tests. They tests are run on a Macbook Pro (2.16GHz) with 1GB RAM, using the latest trunk version of the CDK and JDK 1.5.

The test data is a 1000-molecule subset take from the ZINC collection. The operations I’ve been looking at are

The test harness  simply reads the 1000 molecules one by one and performs the operation in question. For certain tasks which are not atomic in nature, the code does a little more but the timing is measured only for the operation under study. In all cases, things like loading molecules from disk are not measured. The whole process is repeated 10 times and the times reported are the average of the 10 runs. A brief overview of the results:


Read Full Post »

Older Posts »