I met with Jean-Claude Bradley yesterday and we had a pretty useful hack session, allowing him to easily incorporate chemical and cheminformatics functionality into a GoogleDocs spreadsheet.
A common task that Jean-Claude wanted to automate was the calculation of milligrams (or milliliters) of a chemical required for a certain molarity. So what we need for this calculation is the compound name, desired molarity, molecular weight and the density. Importantly, the people who’d like to use this will provide compound names and not a directly parseable SMILES. So we’d also like to (optionally) get the SMILES. Finally, he wanted to be able to do this in a Google spreadsheet – rather than a specific web page or stand alone program.
We start of with a Google spreadsheet containing a column of chemical names such as “benaldehyde”, “aspirin” and so on. The first thing we need is a value of density. Since ChemSpider reports experimental and predicted values of density the quickest way was to extract it from there. Unfortunately I couldn’t work out how to use the Web API, so I fell back on the venerable tradition of screen scraping. Thus, one can get the page for benzaldehyde via
and then using a regular expression, extract the value of density. In some cases, this won’t be available or multiple values may be listed. Right now we just look at the first available value. To make life simple, I wrapped this is in a simple mod_python program resulting in a simple REST interface to density values:
This returns a page containing a single number for the density and if it couldn’t find it, returns -1. We also get a nice freebie – replacing with benzaldehyde with c1ccccc1C=O also returns the density value. One could also provide the InChI key and get the same result.
OK, so we have an easy way to get the density of a compound given its name in an automated fashion. How to get this into the spreadsheet? If the compound name is in cell A2, simply set the value of cell B2 (for example) to
and on hitting enter the cell value should be 1.045. As you’d expect, dragging B2 down should replicate the function appropriately over the rows.
The next thing we want is to get the molecular weight. While we can also extract this from the ChemSpider web pages, it was a little more tricky. I first tried BeautifulSoup but it seems that ChemSpider has some weird Unicode stuff going on and this approach didn’t work (at least in the 10 minutes I looked at it!). So I fell back to our trusty mirror PubChem. This database has a collection of synonyms associated with each compound. Given a chemical name, we can look up the synyonym and identify a compound id (CID) and simply pull out the reported molecular weight in a single SQL query. Once again, this gets wrapped in a simple REST interface
giving back a single number if the molecule was found in the database. As with the density, we can set the value of a cell in the spreadsheet using ImportData. Of course it’s useful to get the SMILES for a chemical name. While there are tools that one could use such as LexiChem, one could use ChemSpider. However I already had a REST service that uses our local PubChem mirror allowing one to write
giving back the SMILES. Once again, this is easily incorporated into a Google spreadsheet. So at this point we have columns of chemical names (manually entered), desired molarity (manually entered), molecular weight (via REST) and density (via REST). The final calculation is trivial and voila, one has the required mililiters of the chemical required.
This was a quick hack so there are a number of issues. It’s possible that the ChemSpider query will return multiple hits. Right now, such cases are treated as getting no hits – since we don’t know which one we want. We’re using the PubChem mirror for name to structure / weight lookup. There are two issues – first, it’s usually a month out of sync with the real PubChem. And second, multiple hits for a given name are possible and if the input name is slightly misspelled, it may not be found in the DB even though it exists. Another problem is that while we can get the density via common name, SMILES or InChi, the PubChem mirror lookup only works with common name. But extending this to support other forms of input is trivial.
Ideally, it’d be all done through one source (say ChemSpider), but the current approach does highlight the distributed nature of the solution. In the end the individual pieces are quite simple in terms of code and content. But by putting it all together, we end up with a very useful application that does real stuff!
But the bigger message that comes out of this is that while Google spreadsheets is a general purpose tool, it’s online nature, coupled with the ability to import data from the web makes it customisable for different domains. In this case, it turns out we can do cheminformatics in a Google spreadsheet.