Abstract
Predicting molecular properties remains a challenging task with numerous potential applications, notably in drug discovery. Recently, the development of deep learning, combined with rising amounts of data, has provided powerful tools to build predictive models. Since molecules can be encoded as graphs, Graph Neural Networks (GNNs) have emerged as a popular choice of architecture to tackle this task. Training GNNs to predict molecular properties however faces the challenge of collecting annotated data which is a costly and time consuming process. On the other hand, it is easy to access large databases of molecules without annotations. In this setting, self-supervised learning can efficiently leverage large amounts of non-annotated data to compensate for the lack of annotated ones. In this work, we introduce a self-supervised framework for GNNs tailored specifically for molecular property prediction. Our framework uses multiple pretext tasks focusing on different scales of molecules (atoms, fragments and entire molecules). We evaluate our method on a representative set of GNN architectures and datasets and also consider the impact of the choice of input features. Our results show that our framework can successfully improve performance compared to training from scratch, especially in low data regimes. The improvement varies depending on the dataset, model architecture and, importantly, on the choice of input feature representation.