Automatic Analog Generation With Common R-group Replacements
Use data from a recently published database to generate close analogs
In a recent paper, Kosuke Takeuchi, Ryo Kunimoto, and Jürgen Bajorath report a database of R-group replacements. The authors analyzed the ChEMBL database and found the groups that were most commonly used to replace the 500 most common substituents. This seems like an interesting resource, and I wanted to put together a quick way of trying it out on a molecule of interest. As a test, I put together this notebook.
If your primarily interested in just running the notebook, just scroll down to the label Input Data Here and put in the SMILES for your molecule of interest and its core, then just "Run All".
import xmltodict
from collections import defaultdict
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import ReplaceCore, GetMolFrags, ReplaceSubstructs, CombineMols
from itertools import product
import numpy as np
import pandas as pd
A function to convert SMILES to canonical form. This will enable us to look up replacements for specific functional groups (see below).
def smi2cansmi(smi):
mol = Chem.MolFromSmiles(smi)
if mol:
for atm in mol.GetAtoms():
atm.SetIsotope(0)
return Chem.MolToSmiles(mol)
else:
return None
The data from Takeuchi et al is provided as an XML file. In order to use this data we'll convert it into a Python dictionary. There may be a more efficient way to do this. I have to admit that my XML skills are a bit rusty.
def generate_replacement_dictionary(replacement_xml_file):
ifs = open(replacement_xml_file)
xml = ifs.read()
d = xmltodict.parse(xml)
replacement_dict = defaultdict()
for k in d['R_replacements']['center']:
fl = k['first_layer']
if type(fl) is not list:
fl = [fl]
first_layer = [a['@SMILES'] for a in fl]
sl = [ x for x in [a.get('second_layer') for a in fl] if x]
second_layer = []
for row in sl:
if type(row) is not list:
row = [row]
for r in row:
second_layer.append(r['@SMILES'])
replacement_dict[smi2cansmi(k['@SMILES'])] = [first_layer, second_layer]
return replacement_dict
Read the XML file and convert it to a dictionary.
replacement_dict = generate_replacement_dictionary("data/top500_R_replacements.xml")
The R-group database contains two entries for each R-group, level 1 replacements, which are replacements for the R-group, and level two replacements, which are "replacements for replacements". Our dictionary contains a list where the first entry is the level 1 replacements and the second entry is the level 2 replacements.
replacement_dict["*O"]
At this point we can make a choice and either use only the level 1 replacements or use the level 1 and level 2 replacements. If we set "use_second_layer" to True in the cell below, we will use both level 1 and level 2 replacements. If we set it to False, we'll use only the level 1 replacements.
use_second_layer = True
Define a function to get replacements for a substituent. Note that we use the smi2cansmi function we created above. This function now includes a change suggested in the comments by James Wallace (thanks!).
def get_replacements(smi, replacement_dict):
cansmi = smi2cansmi(smi)
return replacement_dict.get(cansmi, [[cansmi], [cansmi]])
get_replacements("*CC",replacement_dict)
smiles_target = "c1cc(ccc1c2c(n(cn2)CC3CC3)c4ccnc(n4)N)F"
core_smarts = "n1cncc1"
mol_target = Chem.MolFromSmiles(smiles_target)
mol_target
mol_core = Chem.MolFromSmarts(core_smarts)
mol_core
As a sanity check, we will make sure the core matches the target molecule.
mol_target.HasSubstructMatch(mol_core)
Given a molecule and core, we can use the function ReplaceCore from the RDKit to get the sidechains.
sidechain_mol = ReplaceCore(mol_target,mol_core,labelByIndex=True)
sidechain_mol
The ReplaceCore function puts all of the sidechains into one molecule. We can split these up with the function GetMolFrags. Note that the fragments are labeled with the core atom to which they were attached. This will be useful when we start putting things back together. Also note that the fragment in the center doesn't appear to have a label. This is because that substituent was attached to atom 0 and the RDKit doesn't label isotopes that are 0.
sidechain_frag_list = GetMolFrags(sidechain_mol,asMols=True)
Chem.Draw.MolsToGridImage(sidechain_frag_list)
Define a function to get the attachment atom for a substituent.
def get_connect_idx(mol):
return max([x.GetIsotope() for x in mol.GetAtoms()])
attach_idx = [get_connect_idx(x) for x in sidechain_frag_list]
attach_idx
Now lets get the SMILES for each of the substituents. Note that the SMILES still have isotope labels indicating the attachment points. This could create a problem when we try to find replacements. Fortunately, the smi2cansmi function defined above strips out the isotope labels.
sidechain_smiles = [Chem.MolToSmiles(x) for x in sidechain_frag_list]
sidechain_smiles
With the R-group SMILES in hand, we can just lookup the replacements
replacement_smiles = [get_replacements(x,replacement_dict) for x in sidechain_smiles]
replacement_smiles
Now we can decide whether we want to use the layer 2 replacements.
if use_second_layer:
tmp_smiles = [a+b for a,b in replacement_smiles]
else:
tmp_smiles = [x[0] for x in replacement_smiles]
replacement_smiles = tmp_smiles
We'll add the original sidechain SMILES at the head of each of the replacement lists.
clean_sidechain_smiles = [smi2cansmi(x) for x in sidechain_smiles]
tmp_list = []
for i,j in zip(clean_sidechain_smiles,replacement_smiles):
tmp_list.append([i]+j)
replacement_smiles = tmp_list
At this point we have all of the replacement R-groups, now we want to put all combinations of these R-groups together to make new molecules. We need to define a function that will delete the dummy atom at the attachment point of each sidechain and define an attachment point. The attachment point will be defined by setting the atom map on the attachment atom to 1.
def prep_sidechain(smi):
mol = Chem.MolFromSmiles(smi)
rw_mol = Chem.RWMol(mol)
remove_idx = -1
for atm in rw_mol.GetAtoms():
if atm.GetAtomicNum() == 0:
remove_idx = atm.GetIdx()
for nbr in atm.GetNeighbors():
nbr.SetAtomMapNum(1)
rw_mol.RemoveAtom(remove_idx)
Chem.SanitizeMol(rw_mol)
return rw_mol
prep_sidechain('*c1ccc(Cl)cc1')
Now we just need a function to glue the pieces together.
def make_analogs(core,attach_pts,sidechains):
prod_list = []
for c in product(*sidechains):
sidechain_mol_list = [prep_sidechain(x) for x in c]
mol = Chem.RWMol(core)
for start_atm,m in zip(attach_pts,sidechain_mol_list):
mol = Chem.RWMol(CombineMols(mol,m))
end_atm = -1
for atm in mol.GetAtoms():
if atm.GetAtomMapNum() == 1:
end_atm = atm.GetIdx()
mol.AddBond(start_atm, end_atm, order=Chem.rdchem.BondType.SINGLE)
for atm in mol.GetAtoms():
atm.SetAtomMapNum(0)
prod_list.append(Chem.MolToSmiles(mol))
return prod_list
Generate a list of analogs for our target molecule
analog_list = make_analogs(mol_core,attach_idx,replacement_smiles)
len(analog_list)
Scroll through the replacements.
mol_list = [Chem.MolFromSmiles(x) for x in analog_list]
AllChem.Compute2DCoords(mol_core)
[AllChem.GenerateDepictionMatching2DStructure(x,mol_core) for x in mol_list]
# Display the grid
Chem.Draw.MolsToGridImage(mol_list[:25],molsPerRow=5)
There are still a few things to do here. I don't think this code will handle stereochemistry and there still may be a few edge cases. Either way, I've already found this useful. Thanks Jürgen!