Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Jun Xia; Chengshuai Zhao; Bozhen Hu; Zhangyang Gao; Cheng Tan; Yue Liu; Siyuan Li; Stan Z. Li

doi:10.26434/chemrxiv-2023-dngg4

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

13 April 2023, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Recent years have witnessed the prosperity of pre-training graph neural networks (GNNs) for molecules. Typically, atom types as node attributes are randomly masked and GNNs are then trained to predict masked types as in AttrMask \citep{hu2020strategies}, following the Masked Language Modeling (MLM) task of BERT~\citep{devlin2019bert}. However, unlike MLM where the vocabulary is large, the AttrMask pre-training does not learn informative molecular representations due to small and unbalanced atom `vocabulary'. To amend this problem, we propose a variant of VQ-VAE~\citep{van2017neural} as a context-aware tokenizer to encode atom attributes into chemically meaningful discrete codes. This can enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant (e.g., carbons) and rare atoms (e.g., phosphorus). With the enlarged atom `vocabulary', we propose a novel node-level pre-training task, dubbed Masked Atoms Modeling (MAM), to mask some discrete codes randomly and then pre-train GNNs to predict them. MAM also mitigates another issue of AttrMask, namely the negative transfer. It can be easily combined with various pre-training tasks to improve their performance. Furthermore, we propose triplet masked contrastive learning (TMCL) for graph-level pre-training to model the heterogeneous semantic similarity between molecules for effective molecule retrieval. MAM and TMCL constitute a novel pre-training framework, Mole-BERT, which can match or outperform state-of-the-art methods in a fully data-driven manner. We release the code at \textcolor{magenta}{\url{https://github.com/junxia97/Mole-BERT}}.

Keywords

Chemical Language Model

Graph Neural Network

Molecular representation learning

drug discovery

molecular property prediction

molecule target affinity

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Accepted by ICLR 2023

Version History

Apr 13, 2023 Version 1

Metrics

2,325

1,933

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2023-dngg4

Funding

National Key R&D Program of China

No. 2022ZD0115100

National Natural Science Foundation of China

No. U21A20427

Competitive Research Fund from the Westlake Center for Synthetic Biology and Integrated Bioengineering

No. WU2022A009

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share