Beam Search Sampling for Molecular Design and Intrinsic Prioritization with Machine Intelligence

Chemical language models enable de novo drug design without the requirement for explicit molecular construction rules. While such models have been applied to generate novel compounds with desired bioactivity, the actual prioritization and selection of the most promising computational designs remains challenging. In this work, we leveraged the probabilities learnt by chemical language models with the beam search algorithm as a model-intrinsic technique for automated molecule design and scoring. Prospective application of this method yielded three novel inverse agonists of retinoic acid receptor-related orphan receptors (RORs). Each design was synthesizable in three reaction steps and presented low-micromolar to nanomolar potency towards RORg. This model-intrinsic sampling technique eliminates the strict need for external compound scoring functions, thereby further extending the applicability of generative artificial intelligence to data-driven drug discovery.


Introduction
Generative deep learning 1,2 can be applied to design pharmacologically active compounds de novo [3][4][5] . Deep learning-based molecular design algorithms can extract high-level molecular features from "raw" molecular representations [6][7][8][9] , such as molecular graphs and the Simplified Molecular Input Line Entry System (SMILES) 10 , potentially allowing them to access unexplored regions of the chemical space 11 . Previous studies demonstrated that "chemical language models" (CLMs) 12,13 that were trained on SMILES strings can generate novel molecules with experimentally validated bioactivity 9,14 . CLMs have shown the ability to learn focused chemical features from small collections of template molecules thanks to transfer learning which reuses previously learned knowledge on a new task [14][15][16] . A CLM is first pre-trained on a large amount of unlabeled data to capture the SMILES 'syntax' (i.e., how alphanumeric characters should be assembled to generate strings that correspond to valid molecules) and the physicochemical properties of the pre-training dataset. Then, the pre-trained CLM is "finetuned" with a smaller set of task-specific data 12,17,18 . During this transfer learning process the CLM is biased towards the chemical space of interest, that is, molecules with desired biological and physicochemical properties. This "few-shot" 19,20 learning ability renders CLMs particularly useful for application to biological targets, for which only few ligands are known.
Previous prospective applications of CLMs for de novo molecule generation relied on "temperature sampling" of virtual molecule libraries. Temperature sampling generates molecules by weighted random sampling from the probability distributions learnt by the CLM during training. Such generated SMILES strings might not always be chemically meaningful (invalid strings) or they do not match the feature distribution of the training data. Additional methods are usually needed to select the most promising designs from the virtual molecule libraries, e.g., based on the similarity to known bioactive molecules or activity predictions 9,12,14 . Here, we used the beam search algorithm as a model-intrinsic alternative to temperature sampling 21,22 . This method enables the automatic generation and prioritization of the designs, without strictly requiring additional compound scoring. Beam search scoring was successfully validated in a prospective application aiming to generate new retinoic acid related orphan receptor (ROR) 23 ligands from scratch.
RORs were chosen as molecular targets because these receptor proteins are an attractive but not extensively studied family of potential drug targets. They constitute a family of ligand-activated transcription factors that mainly act as monomers involved in the circadian control of energy homeostasis 24,25 and immune system regulation 26,27 , among other functions. RORs hold promising pharmacological potential for various indications, specifically for autoimmune diseases 26,27 . No ROR ligand has reached drug approval to date, partially owing to compound-related issues such as poor aqueous solubility, lack of selectivity, and clinical safety concerns 26,28,29 .

Chemical language model and beam search sampling for de novo design
We explored the beam search algorithm 30 to generate molecules from a CLM as a potential alternative to temperature sampling combined with an external ranking method. Given the probabilities learnt by a CLM, a vast number of SMILES strings could be sampled. As it is computationally not feasible to sample all outputs, a heuristic method such as beam search can be used to find the likely outputs. Here, the motivation to find the most likely SMILES strings was based on the hypothesis that the probability for generating a certain SMILES string correlates with the quality of the corresponding molecule with regard to the implicit design objective encoded in the fine-tuning set (e.g., desired bioactivity, physicochemical properties). During molecule generation by beam search sampling, the algorithm progressively adds characters ("tokens") to a SMILES string while keeping track of the k most likely SMILES string(s). When adding a new token, the algorithm computes the probability of each possible growing SMILES string to define the k most likely SMILES string(s) for the next step (Fig 1c). This process is repeated until the SMILES sequence is completed (i.e., the 'end-of-sequence' character is added) or a predefined maximal string length is reached. Thus, beam search can be used to approximately find the most probable molecules, as predicted by (i) the underlying model and (ii) the scoring function, which is computed as the product of the probabilities of each character (Fig. 1c). This scoring function ("beam search score") allows to rank the de novo designs according to the probability of their SMILES tokens.
As a framework to probe beam search sampling, we employed a recently published CLM based on a recurrent neural network with long short-term memory (LSTM) 31 cells. The CLM was trained on 365,063 molecules from ChEMBL 32 , encoded as SMILES strings (Fig. 1a), to iteratively predict the next character of each SMILES string, given the preceding tokens (Fig. 1b). This pre-trained CLM was then biased towards the design objective, i.e., the generation of new molecules with bioactivity on RORs, by transfer learning using collections of known ROR ligands (Supplementary Figure 1,  Supplementary Table 1).

Application of beam search sampling to designing inverse RORγ agonists
For prospective evaluation, we applied the beam search to the design of natural-product inspired RORγ ligands. Learning from natural products as a traditional source of inspiration for drug discovery 33,34 may hold several advantages over learning from purely synthetic molecules, because of the overall higher structural diversity, greater threedimensionality, and often superior selectivity of bioactive natural products 35,36 . We aimed to obtain de novo designs possessing three properties: (i) natural product inspired chemical structure, (ii) synthesizability, and (iii) bioactivity on RORγ. Aiming to fulfil all three objectives during transfer learning, the CLM that was previously pre-trained on bioactive molecules from ChEMBL 15 was fine-tuned on one synthetic and four natural product RORγ modulators (Suppl. Fig. 1). Beam search sampling was started after the fifth epoch of fine-tuning, to ensure that the CLM had sufficiently captured the molecular features of the small fine-tuning set.
All valid SMILES strings generated between epochs 5 and 16 (last fine-tuning epoch) were ranked by beam search scoring. The top five designs according to the beam search score (Fig. 2a) were deemed synthetically inaccessible by medicinal chemists. This was further highlighted by a machine learning algorithm for retrosynthetic analysis (IBM RXN) 37 which did not find a synthetic route for any of these molecules. Apparently, the CLM failed to meet the fundamental design criterion of synthesizability, suggesting that a different fine-tuning strategy might be required.
These findings show a first important benefit of the beam search sampling technique. Beam search sampling reveals the most likely designs from a CLM which can be assessed regarding their compliance with the design objectives to evaluate model performance and the success of fine-tuning. Aiming to improve upon these results, we performed a second experiment, in which we applied a stepwise fine-tuning strategy. First, the pre-trained model was fine-tuned for 20 epochs on 255 synthetic RORγ ligands from the US patent subset of the Protein Data Bank 38 (255 molecules, Suppl. Table 1) to capture both bioactivity and synthesizability. Then, the model was fine-tuned with four natural product RORγ modulators (Suppl. Fig.  1) for 16 epochs, aiming to bias the model towards natural-product-likeness. With this second approach, the top 5 sampled molecules ( Figure 2b) were synthetically accessible according to IBM RXN 37 , which could propose a synthetic route for each of them. Importantly, the computer-generated molecules possess certain natural product characteristics (Fig. 3, Suppl. Table 2), as indicated by a high fraction of sp 3 -hybridized carbon atoms (Fsp 3 ). The top five designs have Fsp 3 values ranging from 50% to 75%. These values are comparable to the values computed for the MEGx natural product library (Analyticon Discovery GmbH, rel. 09-01-2018), and exceed the average Fsp 3 content of the ChEMBL molecules used for pre-training (51±30% and 33±20%, respectively). These data suggested that the two-step fine-tuning procedure complied with the design objectives. The two-step approach was chosen for prospective application.
We then compared the beam search designs obtained with the chosen computational strategy to known RORγ modulators and to the fine-tuning molecules (Fig.  3). Despite being a "greedy" algorithm, the beam search sampling still allowed to explore the chemical space beyond the regions that are populated by the fine-tuning compounds ( Figure 3a). Compared to the inverse RORγ agonists annotated in ChEMBL (IC50 <1 µM, Figure 3b), the beam search designs are structurally more diverse in terms of substructure fragments encoded by Morgan fingerprints 39 . The designs possess a certain degree of similarity to the known active molecules in terms of their three-dimensional shape and partial charge distribution (WHALES descriptors 40 ). Apparently, the CLM, on top of learning the SMILES "syntax", also learned certain "semantic" ligand features that are relevant for binding to macromolecules, such as their molecular shape and partial charge patterns. 41 projection of the compounds sets based on Morgan fragment fingerprints (length = 1024, 2-bond radius, Jaccard-Tanimoto similarity). The location of the two-fine tuning sets, the RORγ modulators annotated in ChEMBL (IC50 <1 µM, 1091 compounds), and the beam search designs are shown. (b) Comparison of the sampled molecular designs with known RORγ modulators (IC50 <1 µM) in terms of Morgan fragment fingerprints ("Morgan") and threedimensional shape and electrostatics descriptors (WHALES). The pairwise distance distribution among known ChEMBL modulators is shown as a reference. "Beam (15)" and "Beam (5)" indicate the top 15 and top 5 designs, respectively. Boxplots indicate 25 th , 50 th , and 75 th percentiles (lines), mean values (circle), and outlier boundaries (whiskers, 1.5 × interquartile range).

Prospective experimental validation
Three beam search designs were synthesized and characterized in vitro. We selected them based on their beam search score. From the five most likely designs (Figure 2b), we selected molecules 1 and 2, which were ranked first and third. Compound 2 showed the highest Tanimoto similarity (Morgan fingerprints) to a known RORγ modulator (Figure 2b). The scaffolds of both compounds were also prominent among the beam search designs not included in the top 5, suggesting a structural preference. The scaffold of 1 also appeared in the design ranked 6 th . Molecules ranked 10 th and 13 th resembled compound 2. Hence, we additionally chose compound 3 of this abundant chemotype from rank 13 for prospective validation. Compounds 1-3 were synthesized according to Scheme 1.
In vitro characterization of compounds 1, 2, and 3 in Gal4-ROR hybrid reporter gene assays confirmed inverse RORγ agonism with micromolar to sub-micromolar IC50 values ( Table 1). The top-ranking compound 1 counteracted RORγ activity with an IC50 value of 4.6 µM. It was additionally active on RORα and RORβ, but precise IC50 values could not be determined due to cytotoxicity. Compounds 2 and 3 counteracted RORγ activity with IC50 values of 0.37 µM (2) and 0.68 µM (3), respectively. In addition to being inverse RORγ agonists, all three synthesized designs revealed pronounced preference for the RORγ subtype, with compounds 2 and 3 possessing more than tenfold higher potency on RORγ compared to the related RORα and RORβ isoforms. These results show that the CLM with beam search sampling conserved the bioactivity of the training molecules in the computational designs.

Methods
Data processing. Molecules were encoded as canonical SMILES strings 10 using the RDKit package (v.2018.03, www.rdkit.org) and only SMILES strings with a length of up to 140 characters were retained. SMILES strings were standardized in Python (v3.6.5, www.python.org) by removing stereochemical information, salts and duplicates.
Datasets. We used the processed data (https://github.com/ETHmodlab/virtual_libraries) 15 we recently published for both ChEMBL (used for pretraining the CLM) and for the representative natural products library, MEGx (released 01 September 2018, Analyticon Discovery GmbH). In total, the processed version of ChEMBL contains 365,063 molecules, and the processed version of MEGx 2,931 molecules. The 255 modulators of RORγ used for double fine-tuning were taken from the US patent subset of the protein data bank (rcsb.org) 38 , and the four natural products, along with the synthetic compound were taken from the study of Solt and Burris 42 . The most similar known RORγ modulators with reported IC50 were extracted from ChEMBL (organism: Homo sapiens).
Chemical language model. We used our recently published framework to implement the CLM (https://github.com/ETHmodlab/virtual_libraries) 15 , which is based on a long shortterm memory (LSTM) 31  We pretrained the CLM for 10 epochs with a learning rate of 10 -3 and performed two rounds of fine-tuning. The first round was applied for 20 epochs with a learning rate of 10 -4 by keeping layer 2 frozen. The second round was applied for 20 epochs with a learning rate of 10 -4 by keeping layer 2 and 3 frozen. The CLMs for both strategies were trained on SMILES strings encoded as one-hot vectors. We used the categorical cross-entropy loss and the Adam optimizer 43 . We applied a 10-fold data augmentation to all molecules.
Beam search ranking. We applied the following procedure for the beam search algorithm: We used a beam width (k) of 50 and defined the maximum SMILES strings length (l) as 140 tokens. We considered in the ranking molecules sampled from epoch 5 to epoch 16 of fine-tuning; the first epochs were not used to ensure the model was sufficiently biased toward the fine-tuning set.
Stochastic Neighbor Embedding. The t-SNE projection was performed with the "tsne" function of MATLAB R2020a 44 on Morgan Fingerprints ('Distance' = 'jaccard'). The perplexity value was optimized from 5 to 50 with a step equal to 10 to minimize the compression error. A perplexity equal to 5 was chosen, corresponding to the minimum compression error (0.328).

Synthetic and in vitro pharmacological procedures are described in the Supplementary Information.
Competing interest G.S. declares a potential financial conflict of interest as a founder of inSili.com GmbH, Zurich, and in his role as consultant to the pharmaceutical industry.

Supplementary Figure 2 | Principal Component Analysis based on descriptors known to differ between synthetic molecules and natural products.
In blue, molecules from ChEMBL. In orange, molecules for the natural products library MEGx. In grey, with thin black lines, de novo designs 1, 2 and 3.

Supplementary Table 2.
Comparison of the fraction of sp 3 -hybridized carbon atoms (Fsp 3 ) between a dataset with synthetic molecules (ChEMBL), a dataset of natural products (MEGx) and the de novo designs 1, 2 and 3. The mean ± standard deviation is reported.