As promised, I redid the cofolding of the orthosteric and allosteric ligand sets from the recent paper by Nittinger et al. with Boltz-2. While there were a few improvements, the results still largely remain the same. For more information on this analysis, please see my previous post. For a higher resolution version of the figure above, please click here. Read more
Over the past few months, we’ve seen another rise in interest in protein-ligand co-folding, especially with the recent release of Boltz-2 and Chai-2. While celebrating scientific progress is important, it’s just as vital to distinguish facts from hype and identify areas where these techniques need further development. For newcomers to the field, co-folding—originally developed as part of the DragonFold project at Charm Therapeutics—builds on the protein structure prediction concepts pioneered by the team at DeepMind working on AlphaFold. While early methods like AlphaFold2 and RoseTTAFold could “only” predict protein structures, these newer approaches can not only determine protein structures but also generate the structures of bound ligands. Co-folding methods use a training set of structures from the PDB to learn the relationship between a protein structure and a corresponding bound ligand. The model learned from the training set is then used to predict the structures of new complexes. While these methods show great promise, they also have limitations. In this post, I’d like to highlight three papers where the authors conducted careful, systematic studies to examine where co-folding methods succeed and where they fall short. I will conclude by discussing where co-folding methods are effective and what steps are necessary to improve them. Read more
A few years ago, I assembled an open-source collection of Python functions and classes that I use regularly. My motivations for putting this together were primarily selfish; I wanted to quickly pip install the functions I use all the time. The result was useful_rdkit_utils, a library of Cheminformatics and machine learning (ML) functions. The library is available on GitHub, and the documentation can be found on readthedocs. The GitHub repo also includes a set of Jupyter notebooks that demonstrate some of the library’s capabilities. I recently refactored the code and added new functionality, so I thought it might be worth writing a blog post to reintroduce the library. Here is a brief overview of the useful_rdkit_utils library. Read more
Introduction One factor often overlooked when applying machine learning (ML) in small-molecule drug discovery is the influence of tautomers on model predictions. Drug-like molecules, especially those containing heterocycles and conjugated pi systems, can exist in several different tautomeric forms. These forms feature varying bond orders between the atoms. Consequently, the molecular representation used in an ML model varies. This remains true regardless of whether we’re using molecular fingerprints, topological descriptors, or message passing neural networks (MPNN). Read more
Introduction One thing newcomers to machine learning (ML) and many experienced practitioners often don’t realize is that ML doesn’t extrapolate. After training an ML model on compounds with µM potency, people frequently ask why none of the molecules they designed were predicted to have nM potency. If you’re new to drug discovery, 1nM = 0.001µM. A lower potency value is usually better. It’s important to remember that a model can only predict values within the range of the training set. If we’ve trained a model on compounds with IC50s between 5 and 100 µM, the model won’t be able to predict an IC50 of 0.1 µM. I’d like to illustrate this with a simple example. As always, all the code that accompanies this post is available on GitHub. Read more