Feeds:
Posts
Comments

With all the stuff I’ve been hearing about Git I’ve been looking to play around with it. While I have been hosting my own Subversion repo on my office machine, the use of GitHub seemed like a good way to play with Git and also have a stable external repo.

So right now the CDKDescUI project has been shifted into Git and is located here. I’ve also shifted my REST web services here

Joerg has made a nice blog post on the use of Open Source software and data to analyse the occurence of antithrombotics. More specifically he was trying to answer the question

Which XRay ligands are closest to the Fontaine et al. structure-activity relationship data for allowing structure-based drug design?

using Blue Obelisk tools and ChemSpider and where Fontaine et al. refers to the Fontaine Factor Xa dataset. You should read his post for a nice analysis of the problem. I just wanted to consider two points he had raised.

Continue Reading »

I recently described a REST based service for performing PCA-based visualization of chemical spaces. By visiting a URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
c1ccccc1,c1ccccc1CC,c1ccccc1CCC,C(=O)C(=O),CC(=O)O

one would get a HTML, plain text or JSON page containing the first two principal components for the molecules specified. With this data one can generate a simple 2D plot of the distributions of molecules in the “default” chemical space.

However, as Andrew Lang pointed out on FriendFeed, one could use SecondLife to look at 3D versions of the PCA results. So I updatesd the service to allow one to specify the number of components in the URL. The above form of the service will still work – you get the first two components by default.

To specify more components use an URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/3/mol1,mol2,mol3

where mol1, mol2, mol3 etc should be valid SMILES strings. The above URL will return the first three PC’s. To get just the first PC, replace the 3 with 1 and so on. If more components are requested than available, all components are returned.

Currently, the only available space is the “default” space which is 4-dimensional, so you can get a maximum of four components. In general, visit the URL

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/

to obtain a list of currently available chemical spaces, their names and dimensionality.

Caveat

While it’s easy to get all the components and visualize them, it doesn’t always make sense to do so. In general, one should consider those initial principal components that explain a significant portion of the variance (see Kaisers criterion). The service currently doesn’t provide the eigenvalues, so it’s not really possible to decide whether to go to 3, 4 or more components. For most cases, just looking at the first two principal components will sufficient – especially given the currently available chemical space.

Update (Jan 13, 2009)

Since the descriptor service now requires that Base64 encoded SMILES, the example usage URL is now invalid. Instead, the SMILES should be replaced by their encoded versions. In other words the first URL above becomes

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
YzFjY2NjYzE=,YzFjY2NjYzFDQw==,YzFjY2NjYzFDQ0M=,
Qyg9TylDKD1PKQ==,Q0MoPU8pTw==

The ONSChallenge has been running for some time now and the simple web query form that tied in the data from Google Docs along with web services from IU has turned out to be pretty handy. With more and more data becoming available, I had done some initial exploratory analysis of the measured solubilities. One thing that is useful to the experimentalists is a suggestion of which compound to test next. This could be made on the basis of many factors – availability, ease of synthesis and so on. But one way to look at it is to examine what types of compounds have been tested previously, and suggest that the subsequent compounds be very different from those that have been tested.

Continue Reading »

Being fond of cooking, I’ve tended to collect recipes, utensils and gadgets. One thing that had been missing was a cast iron skillet. I’d been hearing about the wonders of these (naturally non-stick over time, holds heat, evenly distributes heat) for a long time and have been disillusioned with the non-stick stuff (though a small non-stick pan for eggs is handy). So we finally decided to pick up a Lodge cast iron skillet. Though it’s sold as pre-seasoned, we seasoned it once before use.

Our first attempt at using it was to make pan seared steak for Christmas lunch, using directions (1, 2)  from Alton Brown. A juicy 12 oz ribeye, seasoned with kosher salt and coarse ground pepper. Seared for 90 s on the oven top and then put into a 500F oven for 3 minutes each side resulted in a beautiful medium steak. While the steak was resting, we put together a simple sauce with red wine, shallots and the brown bits from the pan.

The result was heavenly! Looks like cooking will be fun with the new skillet.

steak1

Over the last few years there has been a lot of activity in the area of Open Source cheminformatics software. Being a contributor to the CDK as well as a supporter of Open Source and Open Data efforts in general, I was delighted to be given the chance to talk about these topics at the BioIT World Conference & Expo. I’ll be talking about the state of art in Open Source cheminformatics, highlighting the advantages and pitfalls of using this type of software, using examples from toolkits, workbenches, pipelining tools and so on. In addition, I’ll be talking a little bit about Open Data and it’s importance and the possibilities that arise from combining Open Source software and Open Data.

Here’s the announcement of the actual meeting:

Join the life sciences community in Boston, MA next April 27-29, 2009 for the 7th Annual Bio-IT World Conference & Expo (www.bio-itworldexpo.com).  Since its debut in 2002, Bio-IT has established itself as a premier event showcasing the myriad applications of IT and informatics to biomedical research and the drug discovery enterprise.  The 2009 program will feature best practice case studies and joint partner presentations relevant to the technologies, research, and regulatory issues of life science, pharmaceutical, clinical, health, and IT professionals.

News of the ChemSpider Journal of Chemistry has been posted in various places. This effort is interesting as it is a combination of features that are currently available in different forms. Like other Open Access journals, the CJC will be follow the BOAI and hence be Open Access. In addition it will exhibit markup of the text, such as done by the RSC journals (which are not OA). I’m especially interested in this latter feature for automated processing of articles. While it is good to see the combination of these features, it also interesting to see that the journal will use a just-in-time (JIT) approach, and allow online peer review, commentaries. In this sense, it can be expected to be an especially good venue for ONS style projects.

I think this effort will be an interesting experiment, especially given that many  “traditional” chemists may not have blogs and wiki’s to support a JIT approach, and that a journal might be more acceptable. I recently joined the editorial board. I’m eager to see how the journal evolves and am pleased to be able to contribute to this effort and encourages to do so as well.

A few days back, Hari on FriendFeed had asked how one could get a a CAS number from a PubChem compound ID (CID). The reverse, that is finding a CID for a given CAS number is generally quite easy as shown by Rich here and here. Since I was trying to get some writing done, this was a good excuse for a quick hack to solve the problem.

Continue Reading »

I met with Jean-Claude Bradley yesterday and we had a pretty useful hack session, allowing him to easily incorporate chemical and cheminformatics functionality into a GoogleDocs spreadsheet.

A common task that Jean-Claude wanted to automate was the calculation of milligrams (or milliliters) of a chemical required for a certain molarity.  So what we need for this calculation is the compound name, desired molarity, molecular weight and the density. Importantly, the people who’d like to use this will provide compound names and not a directly parseable SMILES.  So we’d also like to (optionally) get the SMILES. Finally, he wanted to be able to do this in a Google spreadsheet – rather than a specific web page or stand alone program.

It turns out that with a liberal helping of Python, a dash of ChemSpider and pinch of PubChem, all of this can be done in a half hour hack session.

Continue Reading »

A while back I wrote about some updates I had made to the CDK fingerprinting code to improve performance. Recently Egon and Jonathan Alvarsson (Uppsala) had made even more improvements. Some of them are simple fixes (making a String[] final, using Set rather than List) while others are more significant (efficient caching of paths). In combination, they have improved performance by over 50%, compared to my last update. Egon has put up a nice summary of performance runs here. Excellent work guys!