Just Because You Published It Doesn’t Mean It’s Right

8 minute read

Published:


Introduction
When I read or review papers on machine learning in drug discovery, I immediately look for a few key points.
1. Did the authors use high-quality datasets?
2. When comparing methods, did they perform appropriate statistical analyses?
3. Was the similarity between the training and test sets reported?

Unfortunately, the answer to all of these questions is often no. An example of this concerning trend is provided in a recent paper from JCIM, which compares several approaches to modeling aqueous solubility. In it, the authors collected data from four different studies, removed duplicates, harmonized the solubility values, and created a dataset with 17,937 chemical structures (actually 17,935, as two structures had errors) with corresponding aqueous solubility values expressed as the log of the molar solubility (LogS). This data was used to train various machine learning models, which were then tested on a solubility dataset originally published by Huuskonen.

To satisfy my curiosity, I downloaded their data and built an ML model using 2D descriptors calculated with the RDKi and XGBoost. The performance of my model, with an R2 of 0.92 and MAE of 0.41, was comparable to the best R2 of 0.92 and MAE of 0.40 reported in the paper.

On the surface, this appears to be a great model. Unfortunately, the performance on the Huuskonen dataset doesn’t reflect what can be expected with “real-world” datasets. There are two factors that make the Huuskonen dataset too easy:
1. The dynamic range is unrealistically large. As I’ve written about before, the dynamic range of a test set should reflect real-world use cases. In drug discovery, aqueous solubility datasets typically span 2 to 3 logs. The Huuskonen dataset spans a whopping 12 logs. When the dynamic range is this large, regression becomes much easier, and R2 values are inflated.
2. While the authors removed structures that were duplicated between the training and test sets, most of the Huuskonen test set compounds have similar neighbors in the training set.

Additionally, the Huuskonen dataset, along with the datasets used to train the model, was assembled from dozens of papers. As I’ve mentioned before, using data like this to build an ML model is a poor idea. Since the experiments were not performed under consistent conditions, it doesn’t make sense to use the data for training or testing an ML model. In a related example, a 2024 paper by Landrum and Riniker showed a significant lack of consistency in IC50 values for the same compounds tested in the “same” assay across different papers. If anything, the situation with solubility values is probably even worse.

Comparing Training and Test Sets

To provide a more realistic view of model performance, I selected two solubility datasets from the Polaris collection. Unlike the datasets used to train the model, the experiments that generated these values were carried out in a consistent manner. The datasets I chose are quite different in nature; the Biogen dataset includes a diverse set of compounds from a screening collection, while the Antiviral dataset was obtained from a lead optimization campaign that was part of the Covid Moonshot effort.

Before building an ML model, it’s always a good idea to perform some exploratory data analysis. We begin by comparing the range of values in the training set with those in the test sets. One effective way to do this is by creating a box plot. The plot below shows the distribution of LogS for the training and test sets. The boxplots reveal two main points. First, the distributions of LogS for the training set and the Huuskonen test set are similar. Second, the dynamic range for the Antiviral and Biogen sets is considerably smaller than that of the other two sets.

This becomes even clearer when looking at the interquartile range (IQR) for the three test sets. The IQR represents the height of the box in the box plots. The IQR for the train and Huuskonen sets exceeds 2.5 logs, while the other two sets have an IQR below one log. This smaller dynamic range makes it more difficult to achieve a good R2 value from a regression model.

DatasetIQR
Training2.75
Huuskonen2.59
Antiviral0.83
Biogen0.55

Another factor to consider in exploratory data analysis is how similar the test set(s) are to the training set. We can measure this by calculating the Morgan fingerprint Tanimoto similarity between the training and test sets and recording the highest similarity for each test set compound relative to the training set. By analyzing the Tanimoto similarity distribution, we can assess the overall difficulty of evaluating the model. In the figure below, we display the distributions of maximum Tanimoto similarity for each of the three test sets. For the Huuskonen dataset, at least half of the compounds have a Tanimoto similarity greater than 0.5 to the training set. If we set a heuristic cutoff for Tanimoto similarity at 0.35, then most compounds in the Huuskonen data can be considered similar to the training set. We can also see that the Antiviral and Biogen sets are much less similar to the training set.

Performance on Realistic Test Sets

As mentioned earlier, a relatively simple ML model using RDKit descriptors and XGBoost achieved what appeared to be impressive performance. Now, let’s see what happens when this same model is applied to predict the aqueous solubility of datasets that differ from the training set and cover a narrower, more realistic, dynamic range. To evaluate the model’s performance, I generated the same RDKit descriptors for the Antiviral and Biogen datasets. The XGBoost model, trained on the same training set as before, was then used to predict LogS. The plot below compares the experimental and predicted values for each of the three test sets. As shown, there is no apparent correlation between the experimental and predicted values for the Antiviral and Biogen test sets. While an MAE around 10-fold may seem somewhat useful, a simple null model that predicts each value as the dataset mean has an MAE of approximately 0.53.

How Can We Improve the Situation?

Papers like the one in question aren’t helping to advance the field. In fact, they’re doing the opposite. They lead readers to believe that solubility prediction is a solved problem, when it isn’t. Journals should establish guidelines that specify the quality of datasets. It’s 2025, and there’s no reason for anyone to rely on low-quality datasets like those in MoleculeNet or the Therapeutic Data Commons. Unfortunately, we’ve fallen into a cycle where people assume that because a dataset was used in a paper published in a high-impact journal, it must be a “gold standard.”

I believe there are a few things we can do to improve the situation. As authors, we can make sure our papers use high-quality datasets and our benchmarks include proper statistical comparisons. As reviewers, we have the chance to ensure that the papers we review follow best practices. I’ve reached a point where at least half of my reviews contain similar language.

Two issues must be addressed before this paper is reviewed.

1. The authors used the MoleculeNet datasets, which are significantly flawed and should not be used. For more details, see this blog post:
https://practicalcheminformatics.blogspot.com/2023/08/we-need-better-benchmarks-for-machine.html.
Machine learning papers should use high-quality datasets like those provided by Polaris:
https://polarishub.io/datasets?certifiedOnly=true
2. The reported benchmarks do not include a statistical comparison of the methods being compared. Tables X and Y are not enough. For guidance on proper statistical comparisons in machine learning, please refer to this paper:
https://pubs.acs.org/doi/full/10.1021/acs.jcim.5c01609.

Oops, I guess my reviews are no longer anonymous.

Finally, journal editors and editorial advisory boards can establish guidelines that incorporate best practices. Journals should endorse datasets like those provided by the Polaris initiative, which involves groups of experts from industry and academia certifying the quality and relevance of benchmark datasets. I hope to reach a point where reviewers of ML papers have a checklist to ensure that high-quality datasets are used and that appropriate statistical tests are applied when comparing methods. In 2011, I was part of a team that developed guidelines for computational papers in the Journal of Medicinal Chemistry. These guidelines significantly improved the quality of papers in the journal. I am convinced we can achieve the same with papers focused on machine learning in drug discovery.

Code and Data
The code and data used to generate this analysis can be found on GitHub