The Trouble With Tautomers
Published:
 Introduction
 One factor often overlooked when applying machine learning (ML) in small-molecule drug discovery is the influence of tautomers on model predictions. Drug-like molecules, especially those containing heterocycles and conjugated pi systems, can exist in several different tautomeric forms. These forms feature varying bond orders between the atoms. Consequently, the molecular representation used in an ML model varies. This remains true regardless of whether we’re using molecular fingerprints, topological descriptors, or message passing neural networks (MPNN).
I won’t go into a detailed explanation of tautomers here. Those interested in the topic are urged to read the excellent reviews on the subject by Roger Sayle and Yvonne Martin. This post from Jeremy Monat also provides a good introduction to dealing with tautomers in the RDKit. After a LinkedIn exchange about ML and tautomers, I began to consider a few fundamental questions that I don’t recall seeing addressed in the literature.
- How prevalent are tautomers in drug-like molecules?
- How significantly do tautomers impact predictions made by ML models?
- Are there methods to reduce inconsistencies in predictions generated for different tautomers of the same molecule?
To address these questions, I used the aqueous solubility data from a 2023 paper by Cheng Fang and coworkers at Biogen. In the remainder of this post, I will refer to this as the Biogen Solubility Dataset (BSD). I chose this set for several reasons.
- It’s reasonably large, containing 2,173 compounds.
- The molecules are drug-like and commercially available screening compounds.
- The data covers a reasonable dynamic range, with 3.2 logs between the highest and lowest values.
How Pervasive are Tautomers in Drug-like Molecules? 
 To address the first question above, I used the rdMolStandardize class in the RDKit to generate a set of tautomers for each molecule in the BSD. It should be noted that tautomer generators like the one in the RDKit use algorithmic rules to produce tautomers. Many of the generated tautomers do not represent physically stable species. The histogram below shows the distribution of the number of tautomers calculated for each of the 2173 molecules. After the tautomer enumeration, we can see that 25% of molecules have only one tautomer. The remainder had between two and 292 tautomers, with a median of two. Note that the RDKit stops tautomer enumeration at 292, so the true maximum may have been larger. The bottom line is that most drug-like molecules will have more than one calculated tautomer. Let’s see how this can impact an ML model. 
 
How Do Tautomers Impact ML Model Predictions?
 Next, I wanted to investigate the impact of tautomers on ML model predictions. To examine this phenomenon, I began with a set of tautomers generated for each molecule in the BSD. I then calculated the pairwise Tanimoto similarities between pairs of tautomers for the same molecule and selected the least similar tautomer pair. The histogram below shows the distribution of Tanimoto similarities for the least similar tautomer pairs. We see a large peak at 1 in the distribution, representing molecules with only one tautomer. However, in 30% of cases, we encounter tautomer pairs where the Tanimoto similarity between the pair is less than the widely accepted similarity cutoff of 0.35.
The figure below shows pairs of structures where two tautomers of the same molecule exhibit low similarity. The value below each tautomer pair indicates the Tanimoto similarity for that pair. An examination of the structures reveals that in most cases, at least one of the tautomers is unrealistic and does not accurately represent the physical state of the molecule. Keep in mind that tautomer generation algorithms operate solely on the molecular graph and do not reflect the underlying physics. Consequently, the tautomer pairs generated here should be considered a worst-case scenario.
How Does Training Set Composition Impact Prediction Variability? 
 Before we investigate the impact of tautomers on ML model variability, let’s establish a baseline. The composition of the training data will affect the predictions generated by an ML model. To establish a baseline, we will perform 10 train/test splits and record the test set predictions. Using the standard train_test_split procedure in scikit-learn, the data is divided so that 75% is allocated for training and 25% for testing. Given this split, each molecule will appear in the test set an average of 2.5 times. The histogram below illustrates the number of times each molecule appeared in the test set over 10 folds of cross-validation.
After recording the predictions, we calculate the prediction ∆ as the difference between the largest and smallest predicted values for each molecule. The histogram below shows the distribution of the prediction ∆ across 10 folds of cross-validation. We can see that the ∆ is typically very small, with a mean of 0.2 logs, which is less than the 0.3 logs usually attributed to experimental error for aqueous solubility measurements.
How Do Tautomers Impact ML Model Predictions?
1. Train on the input representation and test on tautomers
 Now that we’ve established a baseline, we can examine how using different tautomers to train and test our models impacts the values produced by the models. To begin, we will train our models on the default representation used in the input file and perform two predictions, one on each of the dissimilar tautomers. As in the baseline study, we will conduct 10 folds of cross-validation and collect the test set predictions. Additionally, we will calculate ∆, the largest difference in predictions for the two tautomers. This procedure is outlined in the figure below.
- We start with the BSD, consisting of molecules with measured solubility (logS).
- Generate the least similar tautomer pairs for each molecule in the dataset.
- Calculate Morgan fingerprints for the input SMILES and each tautomer.
- Train a LightGBM model on the input SMILES fingerprints.
- Use the model from step 4 to predict the solubility of each tautomer based on its fingerprint.
- Calculate the difference (∆) between the predictions for the two tautomers.
- Repeat the above steps for 10 iterations to assess the variability in predictions.
The boxplot below illustrates the distribution of prediction ∆ between test set tautomer pairs across 10 folds of cross-validation. As shown below, while a few cases have a prediction ∆ exceeding one log, most differences remain small. The mean ∆ is 0.22, and the median is 0.17.
2. Train on the tautomers and test on the input representation 
 In this section, we will invert the procedure we used above. We will train two models, one for each member of the tautomer pair, and then use these models to make predictions based on the input representation. The prediction ∆ will be calculated as the difference between the predictions generated by the two models. The procedure for this section is outlined in the diagram below.
- Generate fingerprints for the input SMILES and each tautomer (as done previously).
- Create two machine learning models, one for each set of tautomers. Train the first model on tautomer set 1 and the second on tautomer set 2.
- Use the models to predict the solubility based on the fingerprints generated from the input SMILES.
- Calculate the difference (delta) between the predictions for the two tautomers.
- Repeat this process for 10 iterations to evaluate the variability in predictions.
The box plot below illustrates the prediction ∆ as a function of cross-validation fold. We observe that, similar to what we encountered when predicting on different tautomers, the average ∆ is less than the typical experimental error. Although there are a few prediction ∆ values between 0.5 and 1.0 logs, the mean ∆ is 0.22 and the median is 0.16.
Comparing Tautomer Predictions With the Baseline
 As we’ve seen, there doesn’t seem to be a significant difference between the variability observed across multiple folds of cross-validation and the variability arising from different tautomer representations. Let’s place this variability side by side for comparison. First, we will aggregate the prediction ∆s for the three methods we compared: 
 1. Baseline - the variability across 10 folds of cross-validation
 2. Train input / test tautomer - train the model on the input representation, test on the two most different tautomers, and calculate the prediction ∆ as the difference
 3. Train tautomer / test input - train two models, one on each of the tautomer sets, and test on the input representation, calculating the prediction ∆ as the difference.
As shown below, although the distributions for the tautomer methods have slightly more outliers, their means remain comparable.
Of course, we don’t want to simply guess that the distributions are similar. We can use a statistical test to compare the distributions. In this case, we will use Tukey’s Honest Significant Difference (HSD) test, which includes a correction for multiple comparisons. We can utilize the tukey.plot_simultaneous method from statsmodels to do this. I discussed this plot in detail in this blog post. Briefly, Tukey’s Honest Significant Difference (HSD) plot is used to determine the 95% confidence intervals, corrected for multiple comparisons, around the mean values for each distribution. These are plotted; the reference distribution is colored blue, and dashed lines are drawn at the confidence intervals. The other distributions are colored red to indicate a statistically significant difference from the reference, or grey to indicate a lack of difference.
As we can see below, both tautomer approaches differ significantly from the baseline but do not differ from each other. That being said, all the differences are small, with mean values between 0.21 and 0.23 logs. Given that experimental errors for aqueous solubility are typically around 0.3 logs, the differences we observe due to tautomer representation probably aren’t something we need to lose sleep over.
Conclusions
 I began this post with a few questions that I’d like to revisit.
- How pervasive are tautomers in drug-like molecules?
Based on one representative dataset, tautomers are very common. 75% of molecules had more than one calculated tautomer. Just for fun, I repeated the same experiment with 1.98 million molecules with a molecular weight of less than 500 from the ChEMBL database and reached a similar conclusion. 74% of the ChEMBL molecules had more than one tautomer.
The following two questions are somewhat related.
- How much of an impact do tautomers have on predictions generated by ML models?
- Are there ways to mitigate inconsistencies in predictions generated for different tautomers of the same molecule?
If the answer to the first is “not much,” then the answer to the second is “do nothing. “ Based on the dataset we investigated, tautomers don’t have a large influence on ML model predictions. While there were a few cases where the prediction ∆ was more than a log, the average ∆ was around 0.2 logs, which is less than experimental error.
I’ve talked with people who believe it is best practice to standardize tautomers before performing model training or inference. I think this approach may be counterproductive for a couple of reasons. 
 1. Canonical tautomers are typically algorithmic and don’t represent experimentally observed species. 
 2. When generating canonical tautomers, two similar molecules can sometimes generate very different tautomers.
My recommendation is to train models and perform inference using the input representation. I do not recommend standardizing tautomers before model training or inference. 
 1. As previously discussed, most tautomerization algorithms are designed to produce a standard representation of a molecular graph. They do not necessarily generate physically relevant tautomers. 
 2. The representations of molecules in databases and compound catalogs are typically drawn by chemists who know what they are doing. We often obtain reasonable tautomers based on the intuition of the individual who drew the molecule. 
 3. As we observed above, the variation in model predictions due to tautomers is usually quite small.
However, as Srijit Seal pointed out in a LinkedIn comment on this post, it is important to generate a canonical tautomeric representation to ensure that a dataset does not contain duplicate structures. I typically achieve this by generating an InChI key and verifying that the number of unique InChI keys matches the number of rows in the table. That said, I still would not use a canonical tautomer representation to build an ML model.
So, it turns out that tautomers tend to have minimal impact on ML predictions. While there are a few cases where large differences occur, these seem to be exceptions rather than the rule. Perhaps I should have titled this post “How I Learned to Stop Worrying and Love Tautomers”. The code I used for this analysis is available on GitHub; please check it out if you’re interested. As always, I’d love to hear what others think. How do you deal with tautomers? Please leave a comment.
Acknowledgement
 I’d like to thank Greg Landrum for valuable discussions and comments on earlier versions of the post.
