Why Don’t Machine Learning Models Extrapolate?
Published:
Introduction
One thing newcomers to machine learning (ML) and many experienced practitioners often don’t realize is that ML doesn’t extrapolate. After training an ML model on compounds with µM potency, people frequently ask why none of the molecules they designed were predicted to have nM potency. If you’re new to drug discovery, 1nM = 0.001µM. A lower potency value is usually better. It’s important to remember that a model can only predict values within the range of the training set. If we’ve trained a model on compounds with IC50s between 5 and 100 µM, the model won’t be able to predict an IC50 of 0.1 µM. I’d like to illustrate this with a simple example. As always, all the code that accompanies this post is available on GitHub.
A Simple Experiment
Let’s examine one of the simplest models we can create to predict a molecule’s molecular weight (MW) based on its chemical structure. The model will be trained on molecules with molecular weights ranging from 0 to 400. After training, we will evaluate each model’s performance on two test sets. First, we will assess the model’s performance on another set of molecules within a similar MW range, which we will call TEST_LT_400. Next, we will conduct a more challenging test using molecules with molecular weights from 500 to 800, referred to as TEST_GT_500. The box plot below compares the molecular weight distributions for the training set and the two test sets.
We will employ two distinct methods to build the model and prevent bias arising from model architecture. First, we will utilize Morgan count fingerprints calculated with the RDKit as descriptors, constructing a model using LightGBM, an ensemble method that generates multiple decision trees. Simultaneously, we will develop a model with ChemProp, which employs a message-passing neural network (MPNN) for molecular representation and subsequently uses a feed-forward neural network (FFNN) for training and inference. As a control, we will also include linear regression with the same fingerprints used by LightGBM. Since linear regression simply generates a set of coefficients for a given set of variables, it should allow us to predict values outside the training set.
Testing on Similar Data Distributions
We used the same training set, which consisted of 750 randomly selected molecules from the ChEMBL database with molecular weights under 400 for all models. When we tested on the 250 molecules in TEST_LT_400, we observed reasonably good performance. In all cases, we ended up with models that had a Pearson r greater than 0.70. This is logical, as the molecular weights of the training and test sets have similar distributions.
Testing on Dissimilar Distributions
However, the results are not very promising when we use our model, which is trained on molecules with molecular weights less than 400, to predict the molecular weights of the 250 molecules in TEST_GT_500. First, let’s examine the predicted molecular weights for TEST_GT_500. As shown in the histograms below, LightGBM and ChemProp predict values only within the range present in the training set. A few predicted values slightly above 400 can be attributed to model variability. Note that linear regression does predict values outside the training set; however, this does not necessarily mean that the model is accurately extrapolating (see below).
As expected from the distributions of the predicted values for MW_GT_500, both LGBM and ChemProp show poor performance. In these instances, there is no correlation between the actual and predicted molecular weights. Conversely, linear regression can effectively extrapolate into the higher molecular weight range associated with MW_GT_500. Does this imply we should disregard more advanced ML methods and solely rely on linear regression? Probably not. Unfortunately, linear regression is not designed to address the non-linear relationships that modern ML methods can capture. That said, it’s always beneficial to start with simpler methods. You might sometimes be pleasantly surprised by the outcomes.
Conclusion
This post might seem obvious, but I frequently encounter these misconceptions when discussing ML with individuals from chemistry or biology backgrounds. It’s crucial to recognize that our models will be restricted by the range of values present in the training set. For example, if we create a model to predict yields based on reaction conditions and all the reactions in our training set have yields below 50%, the model won’t predict any conditions that can achieve a 90% yield. This doesn’t mean that there aren’t conditions capable of producing a 90% yield; it simply reflects that the model has only experienced yields below 50%. Likewise, if we only train a model on compounds with poor pharmacokinetics (PK), it’s improbable that a model will predict molecules with good PK. To effectively utilize ML models in drug discovery, we must understand their capabilities and limitations. By setting realistic expectations, ML can be a valuable tool to enhance drug discovery projects.
Addendum
This post generated great discussions about why some methods don’t extrapolate. Jan Jensen’s post provides valuable mathematical follow-up and suggests intriguing future directions.