Abstract
Advances in deep learning have expanded the applications of virtual screening for drug-like compounds. More recently generative models have emerged as sources of inspiration for chemists. We introduce a multi-target model, PCMol, that leverages the latent embeddings derived from AlphaFold as a means of conditioning the de novo generative model on target proteins. It is known that the addition of protein descriptors is an effective strategy to extend the applicability domain and prediction capability of quantitative structure-activity relation (QSAR) models, a strategy we refer to as proteochemometrics (PCM). Similarly, the use of AlphaFold latent embeddings within a generative model for small molecules allows it to leverage structural relationships between proteins. This opens up new possibilities such as interpolation within the chemical space of known highly active compounds and extrapolation on the target side based on their similarities to other proteins, which is especially relevant for understudied or novel targets. Our results indicate that PCMol can generate diverse, potentially active molecules for a wide array of proteins, including those with sparse ligand bioactivity data. We also benchmark against existing target-conditioned trans-former models to illustrate the validity of using AlphaFold protein representations to steer the molecular generation process and increase the generalization capabilities to unseen targets. Additionally, we demonstrate the important role of data augmentation in bolstering the performance of generative models in low-data regimes. The open-source package along with a dataset of AlphaFold protein embeddings is available at https://github.com/CDDLeiden/PCMol.
Supplementary materials
Title
Supplementary material
Description
This document provides additional implementation
details of the multi-target de novo generative model PCMol.
Actions