Abstract
Meaningful exploration of the chemical space of druglike molecules in drug design is a highly challenging task due to a combinatorial explosion of possible modifications of molecules. In this work, we address this problem with transformer models, a type of machine learning (ML) model, with recent demonstrated success in applications to machine translation and other tasks. By training transformer models on pairs of similar bioactive molecules from the public ChEMBL dataset, we enable them to learn medicinal-chemistry-meaningful, context-dependent transformations of molecules, including those absent from the training set. Most generated molecules are highly plausible and follow similar distributions of simple properties (molecular weight, polarity, hydrogen bond donor and acceptor numbers) as the training dataset. By retrospective analysis of the performance of transformer models on ChEMBL subsets of ligands binding to COX2, DRD2, or HERG protein targets, we demonstrate that the models can generate structures identical or highly similar to highly active ligands, despite the models having not seen any ligands active against the corresponding protein target during training. Thus, our work demonstrates that transformer models, originally developed to translate texts from one natural language to another, can be easily and quickly extended to “translations” from known molecules active against a given protein target to novel molecules active against the same target, and thereby contribute to hit expansion in drug design.