Abstract
Deep neural networks can directly learn from chemical structures without extensive, user-driven selection of descriptors in order to predict molecular properties/activities with high reliability. But these approaches typically require very large training sets to truly learn the best endpoint-specific structural features and ensure reasonable prediction accuracy. Even though large datasets are becoming the new normal in drug discovery, especially when it comes to high-throughput screening or metabolomics datasets, one should also consider smaller datasets with very challenging endpoints to model and forecast. Thus, it would be highly relevant to better utilize the tremendous compendium of unlabeled compounds from publicly-available datasets for improving the model performances for the user’s particular series of compounds. In this study, we propose the Molecular Prediction Model Fine-Tuning (MolPMoFiT) approach, an effective transfer learning method that can be applied to any QSPR/QSAR problems. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manor, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with a specific endpoints. Herein, the method is evaluated on three benchmark datasets (lipophilicity, HIV, and blood-brain barrier penetration). The results showed the method can achieve comparable or better prediction performances on all three datasets compared to state-of-the-art prediction techniques reported in the literature so far.