Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single monolithic database with long query times, to multiple databases and significantly faster multi-threaded queries.
Indexing Blues: The method allows us to represent each molecules as a 12-D vector. We can then identify molecules similar in shape to a query by identifying the nearest neighbors (within a radius R) to the query molecule in 12-D space. To do this fast, we employ an R-tree index, which allows us to perform such nearest neighbor queries very efficiently. However, we face a problem. The goal of an index is to allow us to avoid scanning a whole table. Ideally, an index will be a compact representation of a table column and in general one expects that an index is stored in RAM. For the case of Pub3D, the size of the R-tree index is approximately 5GB and we cannot store it in RAM. As a result, simply reading the index took a significant amount of time and even by increasing the amount of RAM available to the Postgres server (shared_buffers) didn’t improve things a whole lot. Furthermore, if we wanted to include multiple conformers, the table size could expand by a factor of 10 at the minimum. So the initial approach of a single database was untenable.
Solution: The simple solution was to partition the table into six separate tables, and place them on six different machines (using separate disks rather than something like NFS). This leads to multiple benefits. First, the R-tree index is much smaller for each individual database and all of it (or a large fraction) can be stored in RAM. As a result, queries are significantly faster. Second, each database can be queried independently – since each one uses its own disk, the queries can be truly asynchronous.
The next step is to query these databases. The front end to Pub3D is a PHP page that retrieves results via a REST-like interface (example) to the actual databases. The interface is implemented in Python, using mod_python and psycopg2. Given that we have six databases that can be queried simultaneously, it’s natural that the Python code should be multi-threaded. But before describing that, let’s consider the performance of serial code. That is, query each database one by one. We use a simple query
select count(*) from pubchem_3d
The code to perform the queries would look something like this
for host, port in condetails: con = _getConnection(host, port) if not con: raise Exception cursor = con.cursor() cursor.execute(query) for row in cursor.fetchall(): allRows.append(row)
The Python code connects to the databases remotely and simply retrieves the result of the query and places it into a global list. The time taken for the above code is approximately 17.4s (the database servers run Postgres 8.2 and have 4GB RAM, with 2.5GB allocated to shared_buffers)
But 17.4s is not fast enough. So we next consider how we can query the databases using multiple threads. To do this we create a class derived from threading.Thread, whose job is to query a single database.
class DBThread(threading.Thread): def __init__(self, host, port, query): self.host = host self.port = port self.query = query threading.Thread.__init__(self)
def run(self): global allRows con = _getConnection(self.host, self.port) if not con: return cursor = con.cursor() cursor.execute(self.query) for row in cursor.fetchall(): allRows.append(row) con.close() return
The class is initialized with the host name and port number for the remote database, along with the query string. When each instance of the class is started, it performs the query and puts the rows of the result into a global list. So we can easily start up six threads by writing, and wait for them to all finish before proceeding.
query = "select count(*) from pubchem_3d" for host, port in condetails: dbt = DBThread(host, port, query) threadList.append(dbt) dbt.start() while threading.activeCount() > 1: time.sleep(1)
When the above code is timed, we get all the results back in 4s – much nicer! Now, this is quite a simple problem as there are no deadlock issues and other monsters of multi-threaded programming to address.
Caveats: The code above is certainly not optimal. First, since there is overhead to starting a thread, so one should be using a thread pool. In this case, the number of threads is always fixed, so I didn’t bother. Second, and probably most importantly, the threading module does not really provide true threads. There has been much discussion surrounding the Global Interpreter Lock (GIL), and I’m not enough of an expert to comment on this. However the problem with the threading module can be seen by the fact that they all run on the same core of a multi-core CPU. Supposedly, Python3000 aims to solve this issue and provides true threading support. An alternative is to consider the multiprocessing module, in 2.6 and above. This module uses subprocess rather than threads and so provides support for concurrency that avoids the GIL. As with the threading module, it’s also quite easy to use.
But the conclusion is that for data parallel type tasks, the Python threading module allows us to easily write code to achieve nice speedups.