Can Gemini Search the ChEMBL Database?

6 minute read

Published:


Introduction
I’ve been an enthusiastic user of the ChEMBL database for more than a decade. ChEMBL contains a treasure trove of information curated from medicinal chemistry journals. When I’m building an initial machine learning model or analyzing the SAR around a particular target, ChEMBL is the first place I go. I’m reasonably good at SQL, but I still spend a lot of time figuring out which data is in which tables and how to join them correctly to get the information I need. For years, I’ve longed for a more straightforward way to query ChEMBL. I recently discovered that the Gemini CLI from Google can write SQL queries for ChEMBL and extract the information I need into a CSV file. Is this my dream come true? Can I stop spending time staring at the ChEMBL schema and just pose a simple natural-language query like this?

> get the smiles,chembl_id, target_name, publication year, article doi,
and IC50 for all kinase inhibitors published after 2022
and write this into a file called kinase_inhibitors_after_2022.csv

This sounds compelling, but does it really work? To put this to the test, I gave the prompt above to Gemini and compared the results with a query I generated by hand. Determining which result was better was tricky and required some detective work. As usual, things weren’t as simple as they initially appeared. Before jumping to the punch line, I should explain a bit about what I did.

Setup

As long as the ChEMBL database file is located in the current directory or below it, Gemini will find it, construct queries, and extract the necessary information. In my case, I was using the SQLite version of ChEMBL, which is great because it doesn’t require me to set up a server, authentication, etc.

To run queries, I only needed two things.

1. The sqlite3 executable, which can be easily installed on any system.
2. The SQLite version of the ChEBML, which can be downloaded from here. Once uncompressed, the file is 28GB, so make sure you have plenty of disk space available.
My directory setup looks like this.

.  
└── chembl_36  
    └── chembl_36_sqlite  
        └── chembl_36.db

Gemini had no trouble finding the ChEMBL database and constructing and running queries.

I found that this works much better with Gemini 3.0. To do this, type “/settings” and set “Preview Features (e.g., models)” to “true”. One caveat is that I found Gemini doesn’t always give the same answer. I ran the same query several times across several systems. In most cases, it gave the “correct” answer (more on this below), but it also made a few mistakes.

  • In one case, Gemini limited the size of the result set. I guess it was trying to save computing resources.
  • In another case, it returned the ChEMBL DOI rather than the article DOI.
  • In a third case, it used a more restrictive version of “kinase” and returned only a few results.

In all three cases above, Gemini returned the correct result after an additional prompt. We’re not there yet, but we’re getting closer.

Does Gemini Return the Correct Answer?

Gemini can construct and run the query, but does it return the correct results? To test, I built my own workflow and ran it as a comparison. I’m starting to feel like a contestant on the old game show “Are You Smarter Than a 5th Grader,” except in this case it’s “Am I Smarter Than an LLM” (probably not). My approach to this was pretty simple.

1. Use the bioservices API to query the UniProt database for UniProt IDs corresponding to all human protein kinases.
2. Convert the UniProt IDs to ChEMBL IDs
3. Use an SQL query to extract compound structures and IC50s for all the targets specified in #2.

For those interested in the gory details, a Jupyter notebook with code I used to extract and analyze the data is available on GitHub. My search returned 44,992 records, while the Gemini search returned 38,571. Why the difference?

The approach Gemini took was similar to mine. The primary difference was how we defined which proteins were kinases. I took the easy route and simply asked UniProt for a list of human kinases. Gemini took a different, and I must admit, clever approach and used ChEMBL’s protein classification hierarchy. The ChEMBL database contains a table called “protein classification,” which encodes a hierarchy of protein classes. At the top level are fundamental protein classes like “Enzyme,” “Adhesion,” and “Secreted protein.” Below “Enzyme” are classes like “Kinase”, “Transferase”, “Oxoreductase”, etc. A small subset of the protein classification hierarchy is shown in the figure below. If we start with a target at the bottom of the hierarchy and work our way up from a kinase target, we should eventually reach the category “Kinase”. If we do, our target is a kinase. This sounds like a reasonable approach, so what went wrong?

It’s Not You, It’s Me

As I mentioned above, my query returned more results than Gemini’s. Let’s look at what happened. First, let’s look at the records Gemini found that my query didn’t. There were 35 records across 4 targets, shown in the table below. Technically, none of these are kinases. The first three are protein kinase regulatory subunits, which lack catalytic activity. The final protein is a pseudokinase. However, all four are classified as kinases in the ChEBML protein classification hierarchy. The query constructed by Gemini did what it was supposed to do. The issue was with how the data was curated.

Next, let’s look at the records my query found that the Gemini query missed. In this case, the difference was much larger. My query found an additional 6,456 records covering 43 proteins. The top-level protein classifications for these proteins are shown in the table below. We can see that 30 are classified as Transferase, which is at the same level in the hierarchy as “Kinase”. An additional 11 proteins were classified as “Enzyme”, not as “Kinase”. One protein, Q96QT4 (TRPM7), is classified as “Transient receptor potential channel” but is also a kinase. More specifically, it is a bifunctional protein that uniquely combines an ion channel with an intrinsic serine/threonine protein kinase domain.

This harkens back to a theme I’ve been harping on for several years. It’s all about the data. Gemini constructed a valid query, but that query is only as good as the data it searches. Currently, the quality of that data depends on the people curating it. If the curators are inconsistent, an LLM will return flawed results.