GNN’s can extrapolate for some properties, but there’s a trick

2 minute read

Published:


This guest post was written by Jeffery Zhou and Alan Cheng, and is a follow-up to “Why Don’t Machine Learning Models Extrapolate?

One of the great things about Practical Cheminformatics is the discussion it spurs from the community. We’ve long been interested in the practical aspects of model extrapolation for small molecule and biologics properties, and we were intrigued by Pat’s findings.

Some GNN background. Graph neural networks (GNNs) such as Chemprop can often show significantly improved predictions when datasets are sufficiently large. They work by generating learned representations (embeddings) for each atom and bond which are dependent on the neighboring atoms and bonds. Relevant to our discussion here is how embeddings are aggregated to generate a single molecular embedding for the whole molecule (think of it as creating a fingerprint representation). They can be aggregated by simply summing up the embeddings, ‘sum aggregation’, or, alternatively, by averaging the embeddings, ‘mean aggregation’. In practice, instead of doing a simple summing up, every bit is divided by a constant value to make the values smaller and this is called ‘norm aggregation’. Our group has empirically found that ‘norm aggregation’ performs better for predicting small molecule activities and properties and this was also reported in the recent Chemprop paper.

Here’s the trick. We intuitively expected GNNs to extrapolate well for a simple property like MW, but this is not what Pat found and we were able to recreate his plot (left below). Jeffery noticed that Pat used ‘mean aggregation’ instead of ‘norm aggregation’. We tried ‘norm aggregation’ and got the good extrapolative results on the right! In Jeffery’s Jupyter notebook, he does additional experiments to show there is no ‘unfair’ data leakage here relative to RF, LGB, and linear regression.

Extrapolation for more complex models. We’ve been investigating the potential of GNNs like Chemprop to extrapolate for more complex endpoints such as small molecule and peptide ADMET endpoints, and have demonstrated improvements in GNN models over RF/LGB models when applied to the extrapolative situations of larger molecules (higher MW), more polar molecules (greater tPSA), and more lipophilic molecules (higher SlogP). Check out our preprint for more details!

Jeffery Zhou (jzhou59[at]jh.edu; jeffery.zhou[at]merck.com)
Alan Cheng (alan.cheng[at]merck.com)