Drug discovery is a multi-stage process, often beginning with the identification of active molecules from a high-throughput screen or machine learning model. Once structure activity relationship trends become well established, identifying new analogs with better properties is important. Synthesizing these new compounds is a logical next step, and is key to research groups that have a synthetic chemistry team or external collaborators. Generative machine learning models have become widely adopted to generate new molecules and explore molecular space, with the goal of discovering novel compounds with desires properties. These generative models have been composed from recurrent neural networks (RNNs), Variational Autoencoders (VAEs), and Generative Adversarial Networks (GANs) and are often combined with transfer learning or scoring of physicochemical properties to steer generative design. While these generative models have proven useful in generating new molecular libraries, often they are not capable of addressing a wide variety of potential problems, and often converge into similar molecular space when combined with a scoring function for desired properties. In addition, generated compounds are often not synthetically feasible, reducing their capabilities outside of virtual composition and limiting their usefulness in real-world scenarios. Here we introduce a suite of automated tools called MegaSyn representing 3 components: a new hill-climb algorithm which makes use of SMILES-based RNN generative models, analog generation software, and retrosynthetic analysis coupled with fragment analysis to score molecules for their synthetic feasibility. We now describe the development and testing of this suite of tools and propose how they might be used to optimize molecules or prioritize promising lead compounds using test case examples.
Supplemental Figure 4