Exhaustive local chemical space exploration using a transformer model

25 October 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


How many near-neighbors does a molecule have? This is a simple, fundamental, but unsolved question in chemistry. It is key for solving many important molecular optimization problems, for example in lead optimization in drug discovery. Generative models can sample virtual molecules from a vast theoretical chemical space, but so far have lacked explicit knowledge about molecular similarity. This means that a generative model needs to be guided by reinforcement learning or another learning mechanism to be able to sample a relevant chemical space. Correspondingly the generative model provide no mechanism for quantifying how completely it can sample a particular region of the chemical space. To overcome these limitations, a novel source-target molecular transformer model is proposed. The transformer model have a similarity kernel based regularization function. It has been trained on, to the best of our knowledge, the largest data set of molecular pairs so far consisting of ≥ 200 billion pairs. The regularization term enforces a direct relationship between the log-likelihood of generating a target molecule and its similarity to a given source molecule. The model is able to systematically sample compounds ordered by their log-likelihood and therefore by their similarity. In combination with a deterministic sampling strategy, beam search, it is possible for the first time to comprehensively explore the near-neighborhood around a specific compound. Our results show that the regularization term helps to substantially improve the correlation between the log-likelihood of generating a target compound and its similarity to the source compound. The resulting model is able to exhaustively sample a near-neighborhood around a drug-like molecule.


Molecular optimization
Tanimoto similarity
Negative log-likelihood
Beam search


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.