Exhaustive local chemical space exploration using a transformer model

22 May 2024, Version 2
This content is a preprint and has not undergone peer review at the time of posting.


How many near-neighbors does a molecule have? This is a simple, fundamental, but unsolved question in chemistry. It is key for solving many important molecular optimization problems, for example in lead optimization in drug discovery under the similarity principle assumption. Generative models can sample virtual molecules from a vast theoretical chemical space, but so far have lacked explicit knowledge about molecular similarity. This means that a generative model needs to be guided by reinforcement learning or another learning mechanism to be able to sample a relevant similar chemical space. Correspondingly the generative model provide no mechanism for quantifying how completely it can sample a particular region of the chemical space. To overcome these limitations, a novel source-target molecular transformer model is proposed, regularized via a similarity kernel function. It has been trained on, to the best of our knowledge, the largest data set of molecular pairs so far consisting of ≥ billion pairs. The regularization term enforces a direct relationship between the probability of generating a target molecule and its similarity to a given source molecule. The model is able to systematically sample compounds ordered by their probability and accordingly by their similarity. In combination with a deterministic sampling strategy, beam search, it is possible for the first time to comprehensively explore the near-neighborhood around a specific compound. Our results show that the regularization term helps to substantially improve the correlation between the probability of generating a target molecule and its similarity to the source molecule. The trained transformer model is able to exhaustively sample a near-neighborhood around a given drug-like molecule.


Molecular optimization
Tanimoto similarity
Negative log-likelihood
Beam search


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.