Three Papers Demonstrating That Cofolding Still Has a Ways to Go

13 minute read

Published: July 21, 2025

Over the past few months, we’ve seen another rise in interest in protein-ligand co-folding, especially with the recent release of Boltz-2 and Chai-2. While celebrating scientific progress is important, it’s just as vital to distinguish facts from hype and identify areas where these techniques need further development. For newcomers to the field, co-folding—originally developed as part of the DragonFold project at Charm Therapeutics—builds on the protein structure prediction concepts pioneered by the team at DeepMind working on AlphaFold. While early methods like AlphaFold2 and RoseTTAFold could “only” predict protein structures, these newer approaches can not only determine protein structures but also generate the structures of bound ligands. Co-folding methods use a training set of structures from the PDB to learn the relationship between a protein structure and a corresponding bound ligand. The model learned from the training set is then used to predict the structures of new complexes.

While these methods show great promise, they also have limitations. In this post, I’d like to highlight three papers where the authors conducted careful, systematic studies to examine where co-folding methods succeed and where they fall short. I will conclude by discussing where co-folding methods are effective and what steps are necessary to improve them.

Don’t Stray Too Far From Home

The first paper I want to highlight is “Have protein-ligand co-folding methods moved beyond memorization?” by Peter Škrinjar and colleagues from Torsten Schwede’s group at the Swiss Institute of Bioinformatics. In this paper, the authors examined the relationship between co-folding performance and similarity to the training set. They used metrics developed for the PLINDER benchmarking effort to evaluate the similarity between a predicted protein-ligand complex and the complexes used to train the co-folding model. Similarity was measured in three ways: overall sequence similarity, binding site similarity, and ligand similarity.

The main takeaway from this paper is summarized in their Figure 1 (shown below), which shows similarity to the training set on the x-axis and co-folding success on the y-axis. As mentioned earlier, similarity to the training set is assessed based on protein sequence, ligand similarity, and binding site interactions. The “success” of the co-folding methods is measured by the root mean squared deviation (RMSD) of the experimental and predicted ligand coordinates, as well as the predicted Local Distance Difference Test (pLDDT), which evaluates how well the predicted structure captures the intermolecular interactions in the experimental structures.

As shown in the figure above, there is an almost linear decline in performance as similarity to the training set decreases. Below 80% similarity to the training set, this performance noticeably diverges from the 80-90% success rate reported in most co-folding studies. Notably, many co-folding papers use a time split in the PDB to separate training and test sets. Models are typically trained on structures deposited before 2021 and tested on structures deposited in 2021 and after. This approach is problematic because crystallographers often deposit structures that are similar or even identical to previously deposited ones. The figure below, which I generated, shows eight structures from the widely used Posebusters dataset, which are nearly identical to structures deposited before 2021. To their credit, the authors of the Posebusters paper pointed out that their dataset has a range of similarities to the training set, a fact that may have been overlooked by those reporting Posebusters as a “gold standard benchmark”. For a higher resolution version of the figure below, please click here.

There are different views on Figure 1 from the Škrinjar paper. The optimistic perspective, mainly supported by those working on co-folding methods, is that most work in drug discovery takes place on the right side of the plot, where current techniques perform well. While research on well-understood protein classes like kinases continues, many in drug discovery believe that most efforts are aimed at new targets, which would be on the left side of the plot. The truth likely lies somewhere in between, with a 50% success rate still falling short of ideal.

Overwhelming Evidence Can Be Misleading

The second paper I want to examine is a study titled “Co-folding, the future of docking – prediction of allosteric and orthosteric ligands” by Eva Nittinger and colleagues at AstraZeneca in Sweden. In this paper, the authors compare the performance of co-folding methods on orthosteric versus allosteric binding sites. They selected 16 cases from the PDB where structures had been solved for both orthosteric and allosteric ligands targeting the same protein. For example, consider checkpoint kinase 1 (CHK-1), a serine/threonine protein kinase that has been a target for therapies aimed at repairing DNA damage. The PDB includes structures such as 2e9n with a ligand bound in the orthosteric ATP binding site. The PDB also includes the structure 3f9n, which contains a ligand bound in an adjacent allosteric pocket. The figure below shows an overlay of these two structures, highlighting the distance between the two binding pockets.

In their paper, Nittinger and colleagues compared the effectiveness of co-folding methods in generating structures of orthosteric and allosteric ligands. The authors used NeuralPlexer, RoseTTAFold All-Atom, and Boltz-1 to generate structures of the orthosteric and allosteric ligands for each of the 16 protein targets. They then calculated the RMSD of the predicted structures against the corresponding experimental structures from the PDB. To provide a simplified overview of the data, I replicated their study using only Boltz-1 and created the figure below, which resembles Figure 1 in the Nittinger study. I chose to focus on Boltz-1 to make the figure simpler than the one in the Nittinger paper. Additionally, I had been using Boltz-1, so I was familiar with the software. While the plot below shows only Boltz-1, the results reported by Nittinger demonstrate a notable agreement between the different co-folding methods. Stay tuned, I’ll repeat this plot with Boltz-2 next week.

The x-axis in the plot shows the RMSD between predicted and experimental ligand structures. The RMSD is calculated by first aligning the predicted protein structure with the experimental structure from the PDB, then computing the RMSD between the corresponding ligand atoms. Each row in the figure represents a pair of protein structures, with blue points indicating RMSD for five orthosteric poses and orange points for five allosteric poses. The labels on the y-axis correspond to the genes associated with the proteins in the PDB. As in the Nittinger paper, a vertical line is drawn at 2.5 Å to mark the limit for what is considered a “successful” pose. As shown in the figure, Boltz-1 was able to successfully generate poses for most of the orthosteric ligands. Two cases where Boltz-1 failed to reproduce orthosteric ligand poses —CYP3A4 and RORC—are flexible proteins that are notoriously difficult to model. Conversely, performance on allosteric ligands was poor, with RMSD values exceeding 10 Å in most cases. As Nittinger points out, this is because the co-folding methods often place the allosteric ligand in the orthosteric pocket.

In their paper, Nittinger and coworkers achieved a modest improvement by co-folding two copies of the allosteric ligand, hoping that one would block the orthosteric pocket and force the other into the allosteric pocket. When two copies of the ligand were used, the co-folding methods managed to place half of the ligands within 2.5 Å of the center of mass of the allosteric ligands. Although this strategy appears somewhat effective, it is unsatisfying and may be difficult to implement prospectively.

As I mentioned in a LinkedIn post, all the orthosteric and allosteric structures used in the Nittinger benchmark were deposited in the PDB before 2021 and were likely included in the training sets for the co-folding methods. Given their presence in the training data, I was surprised that the co-folding methods performed so poorly on the allosteric systems. As Nittinger points out, most structures in the PDB for these targets are orthosteric. It’s possible that this abundance of evidence overrides the signal from the allosteric ligands and causes most predictions to focus on the orthosteric site. This highlights one of the challenges in the current generation of protein structure prediction and co-folding methods. While we can indirectly examine these issues, it’s not easy to look inside and understand the decisions made by the co-folding models.

What Are We Learning?

The third paper I’d like to cover is “Do Deep Learning Models for Co-Folding Learn the Physics of Protein-Ligand Interactions?” by Matthew Masters and colleagues from Marcus Lill’s group at the University of Basel in Switzerland. In this paper, the authors examined the impact of binding site mutations on the poses predicted by AlphaFold3. As part of this study, they performed co-folding using the same ligand SMILES with four variations of the original wild-type protein sequence. The resulting modified protein sequences were co-folded with the original ligand SMILES, and the RMSD to the corresponding X-ray structure was compared. The four variations were:

The wild-type sequence (No Mutation), which remains unaltered from the original PDB.
A set of mutations that removed the binding site (GLY Mutation) by replacing any residue within 3.5 Å of the ligand with GLY.
A set of mutations that crowded the binding site (PHE Mutation) by replacing any residue within 3.5 Å of the ligand with PHE.
A set of mutations that used scores originally developed by Miyata to replace any residue within 3.5 Å of the ligand with a dissimilar residue. In this way, TYR was replaced with GLY, LEU was replaced with ASP, and so on. A complete table of the replacements is in Appendix A of the Masters paper.

The table below displays the RMSD in Å of the predicted ligand coordinates compared to those in the original x-ray structure. The table rows list three different proteins:

8x61 - a cryoEM structure of ATP-bound FtsE
7y97 - an x-ray crystal structure of CYP109B4 with a heme as the ligand
1mzm - an x-ray crystal structure of a lipid transfer protein with a flexible palmitate ligand

The second column in the table shows the RMSD for the wild-type protein, while the following columns present the RMSD for three mutation strategies. Notably, mutations at the binding site had little effect on co-folding performance. As observed, the co-folding methods still managed to position the ligands in the binding site despite removing interacting residues or the steric hindrance caused by larger residues. This is not ideal and suggests a lack of generalizability. The authors propose that the co-folding methods may be learning longer-range patterns rather than specific protein-ligand interactions.

Structure	No Mutation	GLY Mutation	PHE Mutation	Reverse Mutation
8x61	1.1	2.0	1.6	2.3
7y97	0.4	0.7	0.7	0.9
1mzm	0.9	1.0	1.6	1.7

It’s important to note that the preprint by Masters was a preliminary study. The authors examined only three protein-ligand systems, and all co-folding was performed using an early server-based version of AlphaFold3. Given the emergence of several other co-folding approaches, it will be essential to conduct similar tests on a wider variety of protein-ligand systems.

What Is It Good For? (with apologies to Edwin Starr)

This raises questions about where co-folding methods should be applied in drug discovery. Since they are slower than docking and their binding poses differ, co-folding approaches are likely not suitable for large-scale virtual screening. Although some claim that binding affinity predictions from co-folding are “as accurate as FEP and 1,000 times as fast,” the current evidence does not support this. The only reliable way to assess how well co-folding methods can predict binding affinity is probably through prospective challenges and real drug discovery projects. To this end, I commend the Boltz team for seeking collaborators who can experimentally validate their methods. I am also excited about initiatives like OpenBind and OpenADMET, which might offer the first opportunities to continually evaluate the prospective performance of machine learning models.

I have been using co-folding methods in a similar way to how I use molecule generation techniques— as tools to develop hypotheses that can be tested experimentally. We have many methods that tell us if a molecule binds, such as biochemical assays, size exclusion chromatography-MS, examining binding in a DEL screen, or performing various other experiments. However, apart from X-ray crystallography or cryoEM, we have very few experiments that can definitively identify where a compound binds. In most cases, we need a hypothesis to guide experiments, such as mutagenesis or installing a photoaffinity label. Even more definitive experiments, like NMR or HDX-MS, are much more useful when a binding hypothesis is available. Co-folding provides an easy way to generate these hypotheses; then, it’s up to our scientific knowledge, intuition, and creativity to decide how to proceed.

Although it may not seem like it, I am very excited about co-folding and other uses of machine learning in drug discovery. We have made some progress, but we still have a long way to go. It’s crucial that we keep testing these methods, understanding their limitations, and working to improve the core algorithms. These advancements will only come through close collaboration among computational method developers, drug designers, and experimentalists.

Share on

Twitter Facebook LinkedIn

Pat Walters

Three Papers Demonstrating That Cofolding Still Has a Ways to Go

Share on

You May Also Enjoy

Performing Exploratory Data Analysis on the OpenADMET ExpansionRx Blind Challenge Dataset

We Still Haven’t Found What We’re Looking For - The Continuing Evolution of Protein-Ligand Co-Folding Methods

Just Because You Published It Doesn’t Mean It’s Right

Time For a New Adventure