Abstract
SMILES is a line notation for entering and representing molecules. Being inherently a language
construct, it allows estimating molecular data in a self-supervised fashion by employing machine
learning methods for natural language processing (NLP). The recent success of attention-based neural
networks in NLP has made large-corpora transformer pretraining a de facto standard for learning
representations and transferring knowledge to downstream tasks. In this work, we attempt to adapt
transformer capabilities to a large SMILES corpus by constructing a GPT-2-like language model. We
experimentally show that a pretrained causal transformer captures general knowledge that can be
successfully transferred to such downstream tasks as focused molecule generation and single-/multi-output molecular-property prediction. For each task, we freeze model parameters and attach trainable
lightweight networks between attention blocks—adapters—as alternative to fine-tuning. With a
relatively modest setup, our transformer outperforms the recently proposed ChemBERTa transformer
and approaches state-of-the-art MoleculeNet and Chemprop results. Overall, transformers pretrained
on SMILES corpora are promising alternatives that do not require handcrafted feature engineering,
make few assumptions about structure of data, and scale well with the pretraining data size.
Supplementary weblinks
Title
Python package for "Generative Pre-Training from Molecules"
Description
Notebooks, scripts, and implementation details of the project.
Actions
View