Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Posts

Performing Exploratory Data Analysis on the OpenADMET ExpansionRx Blind Challenge Dataset

3 minute read

Published: November 08, 2025

When working with a new dataset, many people quickly jump into building a machine learning model. I prefer to start with exploratory data analysis (EDA) to gain a deeper understanding of the data. To address this need, I created a notebook that performs initial EDA on the OpenADMET ExpansionRx Blind Challenge Dataset. Instead of using Jupyter for this analysis, I’m using marimo, a new open source data science notebook environment that enables the creation of interactive data apps with minimal code. I think of marimo as a “better Jupyter” because it offers several features that simplify building interactive data apps, including built-in support for Altair charts, an enhanced table view, and interactive widgets. I’m working on a repository titled “Practical Cheminformatics with marimo,” which demonstrates some ways to use marimo for cheminformatics tasks. This code should be ready in a couple of weeks. Please consider this notebook a preview of what’s to come. For those interested in learning more about marimo, I recommend starting with the following resources. Read more

We Still Haven’t Found What We’re Looking For - The Continuing Evolution of Protein-Ligand Co-Folding Methods

6 minute read

Published: November 03, 2025

More Every Day Last week’s NVIDIA GPU Technology Conference (GTC) featured two announcements that highlighted both the potential and ongoing challenges of protein-ligand co-folding. The recently renamed Genesis Molecular AI announced PEARL, a proprietary co-folding method. Additionally, the OpenFold consortium released the code and model weights for a preview of OpenFold3 (OF3p). Both groups also provided technical reports with initial benchmarks. Besides co-folding, the OpenFold team also shared structure prediction results for protein monomers and complexes, as well as antibody-antigen complexes and RNA monomers. Read more

Just Because You Published It Doesn’t Mean It’s Right

8 minute read

Published: September 20, 2025

Introduction
When I read or review papers on machine learning in drug discovery, I immediately look for a few key points.
1. Did the authors use high-quality datasets?
2. When comparing methods, did they perform appropriate statistical analyses?
3. Was the similarity between the training and test sets reported? Read more

Time For a New Adventure

9 minute read

Published: September 15, 2025

Starting today, September 15, 2025, I will assume a new role as Chief Scientist at OpenADMET, an open science initiative that combines high-throughput experimentation, computation, and structural biology to enhance the understanding and prediction of absorption, distribution, metabolism, excretion, and toxicity (ADMET). After spending the last twenty-five years working on open science as a “hobby,” I am very excited to pursue it as my full-time career. Read more

Redoing the Boltz-1 Analysis of Orthosteric and Allosteric Ligand Cofolding with Boltz-2

less than 1 minute read

Published: July 27, 2025

As promised, I redid the cofolding of the orthosteric and allosteric ligand sets from the recent paper by Nittinger et al. with Boltz-2. While there were a few improvements, the results still largely remain the same. For more information on this analysis, please see my previous post. For a higher resolution version of the figure above, please click here. Read more

Three Papers Demonstrating That Cofolding Still Has a Ways to Go

13 minute read

Published: July 21, 2025

Over the past few months, we’ve seen another rise in interest in protein-ligand co-folding, especially with the recent release of Boltz-2 and Chai-2. While celebrating scientific progress is important, it’s just as vital to distinguish facts from hype and identify areas where these techniques need further development. For newcomers to the field, co-folding—originally developed as part of the DragonFold project at Charm Therapeutics—builds on the protein structure prediction concepts pioneered by the team at DeepMind working on AlphaFold. While early methods like AlphaFold2 and RoseTTAFold could “only” predict protein structures, these newer approaches can not only determine protein structures but also generate the structures of bound ligands. Co-folding methods use a training set of structures from the PDB to learn the relationship between a protein structure and a corresponding bound ligand. The model learned from the training set is then used to predict the structures of new complexes. While these methods show great promise, they also have limitations. In this post, I’d like to highlight three papers where the authors conducted careful, systematic studies to examine where co-folding methods succeed and where they fall short. I will conclude by discussing where co-folding methods are effective and what steps are necessary to improve them. Read more

GNN’s can extrapolate for some properties, but there’s a trick

2 minute read

Published: May 18, 2025

This guest post was written by Jeffery Zhou and Alan Cheng, and is a follow-up to “Why Don’t Machine Learning Models Extrapolate?” Read more

Useful RDKit Utils - A Mötley Collection of Helpful Routines

5 minute read

Published: May 12, 2025

A few years ago, I assembled an open-source collection of Python functions and classes that I use regularly. My motivations for putting this together were primarily selfish; I wanted to quickly pip install the functions I use all the time. The result was useful_rdkit_utils, a library of Cheminformatics and machine learning (ML) functions. The library is available on GitHub, and the documentation can be found on readthedocs. The GitHub repo also includes a set of Jupyter notebooks that demonstrate some of the library’s capabilities. I recently refactored the code and added new functionality, so I thought it might be worth writing a blog post to reintroduce the library. Here is a brief overview of the useful_rdkit_utils library. Read more

The Trouble With Tautomers

13 minute read

Published: May 06, 2025

Introduction
One factor often overlooked when applying machine learning (ML) in small-molecule drug discovery is the influence of tautomers on model predictions. Drug-like molecules, especially those containing heterocycles and conjugated pi systems, can exist in several different tautomeric forms. These forms feature varying bond orders between the atoms. Consequently, the molecular representation used in an ML model varies. This remains true regardless of whether we’re using molecular fingerprints, topological descriptors, or message passing neural networks (MPNN). Read more

Why Don’t Machine Learning Models Extrapolate?

5 minute read

Published: April 26, 2025

Introduction
One thing newcomers to machine learning (ML) and many experienced practitioners often don’t realize is that ML doesn’t extrapolate. After training an ML model on compounds with µM potency, people frequently ask why none of the molecules they designed were predicted to have nM potency. If you’re new to drug discovery, 1nM = 0.001µM. A lower potency value is usually better. It’s important to remember that a model can only predict values within the range of the training set. If we’ve trained a model on compounds with IC50s between 5 and 100 µM, the model won’t be able to predict an IC50 of 0.1 µM. I’d like to illustrate this with a simple example. As always, all the code that accompanies this post is available on GitHub. Read more

portfolio

Portfolio item number 1

Published: December 18, 2025

Short description of portfolio item number 1
Read more

Portfolio item number 2

Published: December 18, 2025

Short description of portfolio item number 2
Read more

publications

Paper Title Number 1

Published in Journal 1, 2009

This paper is about the number 1. The number 2 is left for future work. Read more

Recommended citation: Your Name, You. (2009). "Paper Title Number 1." Journal 1. 1(1). http://academicpages.github.io/files/paper1.pdf

Paper Title Number 2

Published in Journal 1, 2010

This paper is about the number 2. The number 3 is left for future work. Read more

Recommended citation: Your Name, You. (2010). "Paper Title Number 2." Journal 1. 1(2). http://academicpages.github.io/files/paper2.pdf

Paper Title Number 3

Published in Journal 1, 2015

This paper is about the number 3. The number 4 is left for future work. Read more

Recommended citation: Your Name, You. (2015). "Paper Title Number 3." Journal 1. 1(3). http://academicpages.github.io/files/paper3.pdf

Paper Title Number 4

Published in GitHub Journal of Bugs, 2024

This paper is about fixing template issue #693. Read more

Recommended citation: Your Name, You. (2024). "Paper Title Number 3." GitHub Journal of Bugs. 1(3). http://academicpages.github.io/files/paper3.pdf

talks

Talk 1 on Relevant Topic in Your Field

Published: March 01, 2012

This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown! Read more

Tutorial 1 on Relevant Topic in Your Field

Published: March 01, 2013

More information here Read more

Talk 2 on Relevant Topic in Your Field

Published: February 01, 2014

More information here Read more

Conference Proceeding talk 3 on Relevant Topic in Your Field

Published: March 01, 2014

This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field. Read more

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post. Read more

Teaching experience 2

Workshop, University 1, Department, 2015