Useful RDKit Utils - A Mötley Collection of Helpful Routines
Published:
A few years ago, I assembled an open-source collection of Python functions and classes that I use regularly. My motivations for putting this together were primarily selfish; I wanted to quickly pip install the functions I use all the time. The result was useful_rdkit_utils, a library of Cheminformatics and machine learning (ML) functions. The library is available on GitHub, and the documentation can be found on readthedocs. The GitHub repo also includes a set of Jupyter notebooks that demonstrate some of the library’s capabilities. I recently refactored the code and added new functionality, so I thought it might be worth writing a blog post to reintroduce the library. Here is a brief overview of the useful_rdkit_utils library.
REOS - I’ve previously written about functional group filters and their importance in removing objectionable functionality from screening collections or combinatorial libraries. The REOS class in useful_rdkit_utils provides easy programmatic access to the sets of functional group filters available in the ChEMBL database. The REOS class can be instantiated with one or more functional group rule sets and can filter RDKit molecules or SMILES. The new version can also return the results as a Pandas dataframe.
Ring Systems - The useful_rdkit_utils library has two classes that provide functionality for working with ring systems. The RingSystemFinder class identifies ring systems and makes it easy to perform analyses like those in Peter Ertl’s papers. The most helpful class is RingSystemLookup, which I use to filter the output of generative models. Many of these models tend to produce silly molecules that could never exist. These silly molecules often contained unstable or strained ring systems. One quick way to check whether a ring system is viable is to see if it exists in the literature. To this end, I analyzed the ChEBML database and extracted all the ring systems and their frequencies. When presented with a molecule, the RingSystemLookup class returns a list of ring systems and their associated frequencies. In practice, I typically reject any generated molecule containing a ring system that appears fewer than five times in ChEMBL. The useful_rdkit_utils library also contains a few utility functions that make it easier to identify spiro fusions and get information on the ring sizes in a molecule.
Fingerprints and Descriptors—The Morgan (aka ECFP) fingerprints and descriptors in the RDKit have become de facto standards for machine learning (ML) with molecules. I have written a few functions and classes that make accessing these descriptors and using them with ML models easier. All of these classes include a progress bar.
- Smi2Fp generates fingerprints from SMILES. The fingerprints can be binary or counts, and the function returns the fingerprint as either an ExplicitBitVect or a numpy array.
- RDKitDescriptors generates 1D descriptors from SMILES or RDKit molecules and can return results as a Pandas dataframe.
- Ro5Calculator generates descriptors for Lipinski’s Rule of 5, including TPSA, from SMILES or RDKit molecules, and can return results as a Pandas dataframe.
Those interested in using RDKit with ML models may also want to check out these libraries, which provide complete workflows for ML with molecules:
Scikit-Mol - paper, code
MolPipeline - paper, code
datamol - docs, code
Conformers and Molecular Geometry - I’m constantly looking up how to generate conformers with the RDKit. To make my life easier, I wrote a function to encapsulate this capability. The library also has utility functions to calculate a molecule’s center and principal moments of inertia. These functions simply alias existing functionality in the RDKit.
Jupyter Notebooks—Most of my work is in Jupyter Notebooks. I often forget the RDKit commands to turn on SVG for images, use CoordGen to improve structure quality, or set the structure image size. To make this easier, I wrote a set of utility functions starting with “rd_.” To access these functions, I’ll typically put the statement “import useful_rdkit_utils as uru” at the top of the notebook. Then, if I want to customize the notebook, I can type “uru.rd_”, followed by <tab>, and I’ll get a list of customization functions with intuitive names. Most of my notebooks include the function rd_setup_jupyter, which sets my preferred defaults with a single command.
Statistics - I added a few statistical functions that I frequently use.
Bootstrap a confidence interval. When calculating performance statistics for an ML model, one should always calculate a confidence interval around that metric.
Confidence interval for a Pearson correlation coefficient - The confidence interval around Pearson’s r depends on the r value and the number of data points. This function calculates the confidence interval.
Maximum possible correlation - When building a model, the underlying data can be used to estimate the maximum possible correlation given experimental error. I wrote more about this here.
Cross-Validation and Data Splitting - At the end of last year, several of us wrote a preprint outlining a set of recommendations for the statistical validation of machine learning models in drug discovery. This subset of useful_rdkit_utils provides an implementation of some methods described in that preprint.
Seaborn - The library has a few functions to set the defaults I like.
Pandas - A couple of quick utility functions to send the output of value_counts to a dataframe and to capture RDKit molecule parsing errors in a dataframe column.
Units—Two small functions that interconvert between kcal/mol and binding affinity units (uM, nM, etc.). I find these essential when working with free energy calculations.
Acknowledgments - Some functions in this library have been borrowed or adapted from other open-source collections. I would particularly like to thank iwatobipen, whose blog is a constant source of inspiration. I also owe a huge debt to Greg Landrum, Brian Kelley, Paolo Tosco, and the other RDKit developers, without whom none of this would be possible.