Over the past few days I’ve been developing some predictive models in R, for the solubility data being generated as part of the ONS Solubility Challenge. As I develop the models I put up a brief summary of the results on the wiki. In the end however, we’d like to use these models to predict the solubility of untested compounds. While anybody can send me a SMILES string and get back a prediction, it’s more useful (and less work for me!) if a user can do it themselves. This requires that the models be deployed and made available as a web page or a service. Last year I developed a series of statistical web services based on R. The services were written in Java and are described in this paper. Since I’m working more with REST services these days, I wanted to see how easy it’d be to develop a model deployment system using Python, thus avoiding a multi-tiered system. With the help of rpy2, it turns out that this wasn’t very difficult.
The setup is a mod_python based REST service. Before describing the service, a little bit about the models themselves. The setup requires that you develop a model in R and then save it as a binary R file (via save). Right now you have to save the model in a variable called “model” – a bit restrictive but it might change in the future. You can build any type of model that has overloaded the predict method. Once you have that you need to edit a model manifest file that contains information on the author, description of the model and so on. More importantly, you have to specify the descriptors used in the model. This leads to a limitation – the descriptor calculation step of the service uses the CDK descriptor service and so the models must employ the CDK descriptors. While restrictive it’s not too bad, since the CDK has a wide variety of molecular descriptors. You can get more details about how models are deployed and the format of the manifest from the GitHub repository.
With the model file and the manifest details it’s pretty easy to setup a simple Python service that uses rpy2 to load the model, calculate descriptors for an input SMILES (Base64 encoded), get a prediction and return it. Thus, to get a list of available models, visit
This gives a plain text page with a list of model identifiers You can then use a model identifier to get the details of the model (as provided by the author of the model) by appending the identifier. An example would be
Finally, to get a prediction from the above model, simply append a Base64 encoded SMILES string
and you end up with a plain text represtation of the predicted value.
Admittedly the current version of this service is a quick hack and has a number of restrictions. While any type of model can be deployed, something like a random forest model will require you to list many descriptors in the manifest file manually. In the future, this should probably be automated via an R function. While the manifest for a given model can contain an arbitrarily long description, it’s up to the developer to decide what goes in. Ideally, we’d serialize the model to PMML so that we could easily include details such as coefficients, training and validation statistics and so on. The use of PMML would allow easy inclusion in the manifest. On the other hand it’s relatively easy to extract this information from the model file, so it might simply require the construction of a different URL.
Another drawback is the fact that one gets a single return value. Now, it’s pretty easy to extract, say, confidence limits but this is dependent on the nature of the model. Providing more information in the return value would probably best be handled by generating PMML output.
The current format of the manifest file is pretty crude – ideally I’d use Dublin Core to represent provenance and support more details of the model (such as model type etc), thus avoiding the need to load the model file. Also, there is no schema for the format, which would be a useful addition. Some form of versioning information would also be useful.
One of the biggest performance bottlenecks is that the service deals with one SMILES string at a time. If you’re getting predictions for many molecules, this can become slow (since each prediction loads the model file). Ideally, the service would recognize a POST request and pull one or more SMILES from the fields in the request. This would allow predictions in bulk and make it much faster. Another advantage of use POST would be the ability to provide SDF (or any other multi-line) input.
Model deployment is now simple to achieve and Python is sweet!