This is the new home of the Practical Cheminformatics blog. 80 previous posts are still avilable at https://practicalcheminformatics.blogspot.com

2025

Useful RDKit Utils - A Mötley Collection of Helpful Routines

5 minute read

Published:


A few years ago, I assembled an open-source collection of Python functions and classes that I use regularly. My motivations for putting this together were primarily selfish; I wanted to quickly pip install the functions I use all the time. The result was useful_rdkit_utils, a library of Cheminformatics and machine learning (ML) functions. The library is available on GitHub, and the documentation can be found on readthedocs. The GitHub repo also includes a set of Jupyter notebooks that demonstrate some of the library’s capabilities. I recently refactored the code and added new functionality, so I thought it might be worth writing a blog post to reintroduce the library. Here is a brief overview of the useful_rdkit_utils library. Read more

The Trouble With Tautomers

13 minute read

Published:


Introduction
One factor often overlooked when applying machine learning (ML) in small-molecule drug discovery is the influence of tautomers on model predictions. Drug-like molecules, especially those containing heterocycles and conjugated pi systems, can exist in several different tautomeric forms. These forms feature varying bond orders between the atoms. Consequently, the molecular representation used in an ML model varies. This remains true regardless of whether we’re using molecular fingerprints, topological descriptors, or message passing neural networks (MPNN). Read more

Why Don’t Machine Learning Models Extrapolate?

5 minute read

Published:


Introduction
One thing newcomers to machine learning (ML) and many experienced practitioners often don’t realize is that ML doesn’t extrapolate. After training an ML model on compounds with µM potency, people frequently ask why none of the molecules they designed were predicted to have nM potency. If you’re new to drug discovery, 1nM = 0.001µM. A lower potency value is usually better. It’s important to remember that a model can only predict values within the range of the training set. If we’ve trained a model on compounds with IC50s between 5 and 100 µM, the model won’t be able to predict an IC50 of 0.1 µM. I’d like to illustrate this with a simple example. As always, all the code that accompanies this post is available on GitHub. Read more