A few days back, Hari on FriendFeed had asked how one could get a a CAS number from a PubChem compound ID (CID). The reverse, that is finding a CID for a given CAS number is generally quite easy as shown by Rich here and here. Since I was trying to get some writing done, this was a good excuse for a quick hack to solve the problem.
Posts Tagged ‘postgres’
A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.
I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .
“Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”
Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.
Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single monolithic database with long query times, to multiple databases and significantly faster multi-threaded queries.
Pub3D is a 3D version of PubChem, in which we have generated a single conformer for 99% of PubChem using the smi23d suite of programs. The structures are then stored in a PostgreSQL database along with their distance moment shape descriptors described by Ballester and Graham-Richards. This allows us to perform shape similarity queries against a user supplied 3D structure. By partitioning the database (thanks to the CGL folks at IU) and using a spatial index, performance is quite snappy. (I had briefly mentioned this in a presentation at the ACS meeting, last spring).
The database had been down for some time, so today I got it back up and running and AJAX’ified the interface, to make it look a little nicer. jQuery rocks! (OK, the color scheme sucks)
There are obvious drawbacks to the current database – single conformer shape search is not very rigorous, especially since the stored structures are not necessarily the minimum energy conformer. However, we have started generating multiple conformers, so hopefully we’ll address this issue in time. The bigger issue is how this approach to shape similarity compares to other well known approaches such as ROCS. Clearly, a shape descriptor approach is lower resolution to a volumetric approach such as ROCS, so in that sense the results are ‘rougher’. However visual inspection of some searches seems to indicate that it isn’t too bad. The paper describing these shape descriptors didn’t do a rigorous comparison – that’s on our TODO list.
OK, the fun part (a.k.a, coding) is done for now – got to get back to the paper.