Perplexity-based molecule ranking and bias estimation of chemical language models

Michael Moret; Francesca Grisoni; Paul Katzberger; Gisbert Schneider

doi:10.26434/chemrxiv-2021-zv6f1-v2

Biological and Medicinal Chemistry

Search within Biological and Medicinal Chemistry

Perplexity-based molecule ranking and bias estimation of chemical language models

28 October 2021, Version 2

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified molecular input line entry systems (SMILES) strings, in a rule-free manner. However, the quality of these de novo generated molecules is difficult to assess a priori. In this study, we apply the perplexity metric to determine the degree to which the molecules generated by a CLM match the desired design objectives. This model-intrinsic score allows identifying and ranking the most promising molecular designs based on the probabilities learned by the CLM. Using perplexity to compare “greedy” (beam search) with “explorative” (multinomial sampling) methods for SMILES generation, certain advantages of multinomial sampling become apparent. Additionally, perplexity scoring is performed to identify undesired model biases introduced during model training and allows the development of a new ranking system to remove those undesired biases.

Keywords

Chemical language model

Supplementary materials

Title

Description

Actions

Title

Supporting information

Description

Supplementary Table and Figures.

Actions

Supplementary weblinks

Title

Description

Actions

Title

GitHub repository

Description

GitHub repository to reproduce the experiments.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models

Michael Moret, Francesca Grisoni, Paul Katzberger, Gisbert Schneider journal article

Journal of Chemical Information and Modeling , Volume 62, Issue 5

Online publication date: Feb 22, 2022

Version History

Oct 28, 2021 Version 2

Oct 27, 2021 Version 1

Version Notes

Added supporting information.

Metrics

1,827

735

Views

Downloads

Citations

License

The content is available under CC BY 4.0

DOI

10.26434/chemrxiv-2021-zv6f1-v2

Funding

Swiss National Science Foundation

205321_182176

RETHINK initiative at ETH Zurich

Author’s competing interest statement

G.S. declares a potential financial conflict of interest as he is a consultant in the pharmaceutical industry and the co-founder of inSili.com GmbH, Zurich, Switzerland. No other potential conflicts of interest are declared.

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Perplexity-based molecule ranking and bias estimation of chemical language models

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Now Published

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share