Fragment-based Sequential Translation for Molecular Optimization

Searching for novel molecular compounds with desired properties is an important problem in drug discovery. Many existing frameworks generate molecules one atom at a time. We instead propose a flexible editing paradigm that generates molecules using learned molecular fragments--meaningful substructures of molecules. To do so, we train a variational autoencoder (VAE) to encode molecular fragments in a coherent latent space, which we then utilize as a vocabulary for editing molecules to explore the complex chemical property space. Equipped with the learned fragment vocabulary, we propose Fragment-based Sequential Translation (FaST), which learns a reinforcement learning (RL) policy to iteratively translate model-discovered molecules into increasingly novel molecules while satisfying desired properties. Empirical evaluation shows that FaST significantly improves over state-of-the-art methods on benchmark single/multi-objective molecular optimization tasks.


INTRODUCTION
Molecular optimization is a challenging task that is pivotal to drug discovery applications. Part of the challenge stems from the difficulty of exploration in the molecular space: not only are there physical constraints on molecules (molecular strings/graphs have to obey specific chemical principles), molecular property landscapes are also very complex and difficult to characterize: small changes in the molecular space can lead to large deviations in the property space.
Recent fragment-based molecular generative models have shown significant empirical advantages (Jin et al., 2019a;Podda et al., 2020;Xie et al., 2021) over atom-by-atom generative models in molecular optimization. However, they operate over a fixed set of fragments which limits the generative capabilities of the models. Shifting away from previous frameworks, we learn a distribution of molecular fragments using vector-quantized variational autoencoders (VQ-VAE) (van den Oord et al., 2017). Our method builds molecular graphs through the addition and deletion of molecular fragments from the learned distributional fragment vocabulary, enabling the generative model to span a much larger chemical space than models with a fixed fragment vocabulary. Considering atomic edits as primitive actions, the idea of using fragments can be thought of as options (Sutton et al., 1999;Stolle & Precup, 2002) as a temporal abstraction to simplify the search problem.
We further introduce a novel sequential translation scheme for molecular optimization. We start the molecular search by translating from known active molecules and store the discovered molecules as new potential initialization states for subsequent searches. As a monotonic expansion of molecular graphs may end up producing undesirable, large molecules, we also include the deletion of substructures as a possible action. This enables our method to backtrack to good molecular states and iteratively improve generated molecules during the sequential translation process. Previous works optimize molecules either by generating from scratch or a single translation from known molecules, which is inefficient in finding high-quality molecules and often discovering molecules lacking novelty/diversity. Our proposed framework addresses these deficiencies since our method is (1) very efficient in finding molecules that satisfy property constraints as the model stay close to the high-property-score chemical manifold; and (2) able to produce highly novel molecules because the sequence of fragment-based translation can lead to very different and diverse molecules compared to the known active set.
Combining the advantage of a distributional fragment vocabulary and the sequential translation scheme, we propose Fragment-based Sequential Tanslation (FaST), which is realized by an RL policy that proposes fragment addition/deletion to a given molecule. Our proposed method unifies molecular optimization and translation and can generate molecules under various objectives such as property constraints, novelty constraints, and diversity constraints. The main contribution of this paper includes: 1. We demonstrate a way to learn distributional molecular fragment vocabulary through a VQ-VAE and the effectiveness of the learned vocabulary in molecular optimization. 2. We propose a novel molecular search scheme of sequential translation, which gradually improves the quality and novelty of generation through backtracking and a stored frontier. 3. We implement a novelty/diversity-aware RL policy combining the fragment vocabulary and the sequential translation scheme that significantly outperforms state-of-the-art methods in benchmark single/multi-objective molecular optimization tasks.

RELATED WORK
Resolution of molecular optimization. Early works on molecular optimization build on generative models on both SMILES/SELFIES string (Gómez-Bombarelli et al., 2018;Kang & Cho, 2018;Nigam et al., 2021b), and molecular graphs (Simonovsky & Komodakis, 2018;Ma et al., 2018;De Cao & Kipf, 2018;Samanta et al., 2020;Mercado et al., 2021) and generate molecules character-by-character or node-by-node. Jin et al. (2018) generates graphs as junction trees by considering the vocabulary as the set of atoms or predefined rings from the data; Jin et al. (2020) use the same atom+ring vocabulary to generate molecules by augmenting extracted rationales of molecules. Generating molecules using molecular fragments is a well-established idea in traditional drug design (Erlanson, 2011), but has only been recently explored through deep learning models (Podda et al., 2020;Xie et al., 2021;Kong et al., 2021), outperforming previous atom-level models. However, these models use fixed fragment vocabularies, which are typically small and fixed a priori, limiting the chemical space spanned by the models. In our work, we utilize a learned molecular fragment vocabulary, which is obtained by training a VQ-VAE on a large set of fragments extracted from ChEMBL (Gaulton et al., 2012). By sampling fragments from the learned distribution, our model spans a much larger chemical space than methods using a fixed vocabulary (Figure 3b, Figure 3c).
Sequential generation of molecules. Guimaraes et al. (2017); Olivecrona et al. (2017); You et al. (2018) frame the molecular optimization problem as a reinforcement learning problem, but they generate on the atom/character level and from scratch each time, reducing the efficiency of the search algorithm. Jin et al. (2019b) uses a graph-to-graph translation model for property optimization. However, it requires a large number of translation pairs to train, which often involves expert human knowledge and is expensive to obtain. Others have used genetic/evolutionary algorithms to tackle this problem 2021a), which performs random mutations on chemical strings. Although these methods use learned discriminators to prune sub-optimal molecules, the random mutation process can become inefficient in searching for good molecules under complex property constraints. Xie et al. (2021) applies Markov Chain Monte Carlo (MCMC) sampling through editing molecules, while Kong et al. (2021) uses Bayesian optimization on the latent space. On the other hand, we train a novelty/diversity-aware RL policy to search for novel, diverse molecules that retain desired properties. Our method also initializes searches from model-discovered molecules, which greatly improves the efficiency and diversity of the generated molecules. Given the enormous distributional fragment vocabulary that we use in the RL process, search strategies such as random mutation used in genetic algorithms are likely to be very inefficient.

PRELIMINARIES
Message Passing Neural Networks (MPNN) Molecules are represented as directed graphs, where the atoms are the nodes and the bonds are the edges of the graph. More formally, let x = (V, E) denote a directed graph where v i ∈ V are the atoms, and e ij ∈ E are the edges of the graph. The network maintains hidden states h t eij for each directed edge, where t is the layer index. At each step, the hidden representations aggregate information from neighboring nodes and edges, and captures a larger neighborhood of atoms. Iteratively, the hidden states are updated as: Here, f is parameterized by RNN cells (e.g. LSTM cells (Hochreiter & Schmidhuber, 1997) or GRU cells (Chung et al., 2014)), and N (v i ) is the set of neighbors of v i . After T steps of messagepassing, the final node embeddings h vi are obtained by summing their respective incoming edge embeddings: The final node embeddings are then summed to get a graph embedding representation h x = vi h vi .
Vector-Quantised Variational Autoencoders (VQ-VAE) To learn useful representations of fragments, we employ the VQ-VAE architecture (van den Oord et al., 2017), which maps molecule fragment graphs to a discrete latent space through using categorical distributions for the prior and posterior. The VQ-VAE defines a dictionary of k embedding elements, [s 1 , s 2 , ...s k ] ∈ R k×l . Given an input x (here the graph for a molecular fragment), let z e (x) ∈ R d×l be the output of the encoder (a MPNN in our case). We define l to be the same dimension for both encoder output embeddings z e (x) and dictionary embeddings s i , because input z q (x) is computed by finding the l 2 nearest neighbor dictionary elements for each row of z e (x): This embedding scheme allows us to represent each molecular fragment using a length d vector, where each entry takes value from {1, . . . , k} that corresponds to the dictionary embedding index for that row. The combinatorial vocabulary defined by the VQ-VAE has the capacity to represent d k distinct molecular fragments, which lifts the constraints of a limited generative span under a fixed fragment vocabulary.
Since the discretization step does not allow for gradient flow, gradients are passed through the network through approximating the gradient from the dictionary embeddings to the encoder embeddings. Additionally, there is a commitment loss that encourages the encoder to output embeddings that are similar to those in the dictionary (hence commitment). The total loss of the VAE is the following: Where s ij is the closest dictionary element s j for the z e (x) i . Additionally, β is a hyperparameter that controls for contribution of the commitment term, and sg represents the stop-gradient operator.

Molecular Optimization.
Starting with an set of good molecules I (Initial set), the goal of molecular optimization is to generate a set of high-quality molecules C (Constrained set) which satisfy or optimize a set of properties P . High novelty and diversity (detailed in Section 5) are also desired for de novo generation applications. We model the molecular optimization problem as a Markov decision process (MDP), defined by the 5-tuple {S, A, p, r, ρ 0 }, where the state space S is the set of all possible molecular graphs. As an overview, our method introduces novel designs over the action space A and the transition model p (Section 4.1) by utilizing a distributional fragment vocabulary, learned by a VQ-VAE. We define the reward and initial state distribution, r and ρ 0 (Section 4.2) accordingly to implement the proposed sequential translation generation scheme. An illustration of our model is in Figure 1.

LEARNING DISTRIBUTIONAL FRAGMENT VOCABULARY
Molecular Fragments are extracted from molecules in the ChEMBL database (Gaulton et al., 2012). For each molecule, we randomly sample fragments by extracting subgraphs that contain Figure 1: Overview of Fragment-based Sequential Tanslation (FaST). FaST is trained in a two-step fashion. In the first step, we train a VQ-VAE that embeds molecular fragments. In the second step, we train a search policy that uses the learned latent space as an action space. The search policy starts an episode by sampling a molecule from the frontier set F , which consists of an initial set of good molecules (I), and good molecules discovered by the policy (C). The molecule is encoded by an MPNN, which is then used to predict either an Add or Delete action. When the Add action is selected, the model predicts and samples an atom as the attachment point and subsequently predicts a fragment to attach to that atom. When the Delete action is selected, the model samples a directed edge, indicating the molecular fragment to be deleted.
ten or fewer atoms that have a single bond attachment to the rest of the molecule. We then use a VQ-VAE to encode these fragments into a meaningful latent space. The use of molecular fragments simplifies the search problem, while the variable-sized fragment distribution maintains the reachability of most molecular compounds. Because our search algorithm ultimately uses the latent representation of the molecules as the action space, we find that using a VQ-VAE with a categorical prior instead of the typical Gaussian prior makes RL training stable and provides good performance gains (Tang & Agrawal, 2020;Grill et al., 2020).
Encoder/Decoder We use MPNN encoders for any graph inputs, which include both fragments for the VQ-VAE, as well as molecular states during policy learning. The graph models are especially suitable for describing actions on the molecular state, as they explicitly parametrize the representations of each atom and bond. Meanwhile, the decoder architecture is a recurrent network that decodes a SELFIES representation of a molecule. We choose a recurrent network for the decoder because we do not need the full complexity of a graph decoder. Due to the construction scheme, the fragments are rooted trees, and all have a single attachment point. As our fragments are small in molecular size (≤ 10 atoms), the string grammar is simple to learn, and we find the SELFIES decoder works well empirically.
Adding and deleting fragments as actions. At each step of the MDP, the policy network first takes the current molecular graph as input and produces a Bernoulli distribution on whether to add or delete a fragment. Equipped with the fragment VQ-VAE, We define the Add and Delete actions at the fragment-level: • Fragment Addition. The addition action is characterized by (1) a probability distribution over the atoms of the molecule: where σ is the softmax operator.
(2) Conditioned on the graph embedding h x and the attachment point atom v add sampled from p add , we predict a d-channel categorical distribution where each row of p f ragment sums to 1. We can then sample the discrete categorical latent z add ∈ {1, ..., k} d from p f ragment . The fragment to add is then obtained by deocoding z add through the learned frozen fragment decoder. We then assemble the decoded fragment with the current molecular graph by attaching the fragment to the predicted attachment point v add . Note that the attachment point over the fragment is indicated through the generated SELFIES string.
• Fragment Deletion. The deletion action acts over the directed edges of the molecule. A probability distribution over deletable edges is computed with a MLP: One edge is then sampled and deleted; since the edges are directed, the directionality specify the the molecule to keep and the fragment to be deleted.
With the action space A defined as above, the transition model for the MDP is simply p(s |s, a) = 1 if applying the addition/deletion action a to the molecule s results in the molecule s , and p(s |s, a) = 0 otherwise. The fragment-based action space is powerful and suitable for policy learning as it (1) is powered by the enormous distributional vocabulary learned by the fragment VQ-VAE, thus spans a diverse set of editing operations over molecular graphs; (2) exploits the meaningful latent representation of fragments since the representation of similar fragments are grouped together. These advantages greatly simplify the molecular search problem. We terminate an episode when the molecule fails to satisfy the desired property or when the episode exceeds ten steps.

DISCOVER NOVEL MOLECULES THROUGH SEQUENTIAL TRANSLATION
We propose sequential translation that incrementally grows the set of discovered novel molecules and use the model-discovered molecules as starting points for further search episodes. This regime of starting exploration from states reached in previous episodes was also explored under the setting of RL from image inputs (Ecoffet et al., 2021). More concretely, we implement sequential translation with a reinforcement learning policy that operates under the fragment-based action space defined in Section 4.1, while using a moving initial state distribution ρ 0 , which is a distribution over molecules in the frontier set F = I ∪ C. By starting new search episodes from the frontier set -the union of the initial set and good molecules that are discovered by the RL policy, we achieve efficient search in the chemical space by staying close to the high-quality subspace and achieve novel molecule generation through a sequence of fragment-based editing operations to the known molecules. Our proposed search algorithm is detailed in Algorithm 1.
Algorithm 1 Molecular Optimization through Fragment-based Sequential Tanslation (FaST) 1: Input N the desired number of discovered new molecules 2: Input I the initial set of molecules 3: Input D the pretrained fragment decoder of VQ-VAE 4: Input CP : S → {0, 1} returns 1 if the input x satisfies desired properties Equation (5) 5: Input CND : S → {0, 1} returns 1 if the input x satisfies novelty/diversity criterion Equation (6) 6: Let C = ∅ be the discovered set of molecules 7: Let F = I ∪ C be the frontier where search is initialized from 8: Let t = 0 be the number of episodes 9: while |C| ≤ N do 10: Let t = t + 1 11: Update U CB(x0, t)∀x0 ∈ F according to Equation (8)  12: Discover novel molecules and expand the frontier. Our method explores the chemical space with a property-aware and novelty/diversity-aware reinforcement learning policy that proposes addition/deletion modifications to the molecular state at every environment step to optimize for the reward r. We gradually expand the discovered set C by adding qualified molecules found in the RL exploration within the MDP. A molecule x is qualified if: (1) x satisfies the desired properties measured by property scores where P is the set of desired properties and threshold p is the score threshold for satisfying property p. A molecule x satisfying all desired properties has C P (x) = 1 and C P (x) = 0 otherwise.
(2) x is novel/diverse compared to molecules currently in the frontier F , measured by fingerprint similarity (detailed in Section 5): Where sim denotes fingerprint similarity, threshold nov and threshold div are predefined similarity thresholds for novelty and diversity, I and C are the initial set of good molecules and model discovered molecules as defined in previous sections. A molecule that satisfies novelty/diversity criterion has C N D (x) = 1 and C N D (x) = 0 otherwise.
We use a reward of +1 for a transition that results in a molecule qualified for the set C, and discourage the model from producing invalid molecules by adding a reward of −0.1 for a transition that produces an invalid molecular graph 1 : where [x ← a] denotes the molecule resulting from editing x with the fragment addition/deletion action a.
Initialize search episodes from promising candidates. To bias the initial state distribution ρ 0 to favor molecules that can derive more novel high-quality molecules, we keep an upper-confidencebound (UCB) score for each initial molecule in the frontier F . We record the number of times we initiate a search N (x, t) from a molecule x ∈ F , and the number of molecules qualified for adding to C that is found in an episode strating from x: R(x, t). Here t = x∈ρ0 N (x) is the total number of search episodes. The UCB score of the initial molecule m is calculated by: The probability of a molecule in the initialization set being sampled as the starting point of a new episode is then computed by a softmax over the UCB scores: p init (x, t + 1) = exp(U CB(x,t)) x∈I exp(U CB(x,t)) . To summarize, FaST learns a policy that (1) choose good initial molecules to start search episodes; (2) choose to add a fragment to or delete a subgraph from a given state (a molecule in our case); (3) choose what to add through predicting a fragment latent embedding, or what to delete through predicting a directed edge, and remove part of the molecular graph accordingly.
Although we present our method in this section under the most realistic multi-objective optimization task settings (with experiments in Section 5), our method is easily extendable to other problem settings by modifying the definition of the constraints C P , C N D , and the reward function r accordingly. For example, see Appendix B for the application of our method to the standard constrained penalized log P task and Section 5 for multi-objective molecular optimization under different novelty/diversity metrics.

EXPERIMENTS
Datasets. We use benchmark datasets for molecular optimization, which aims to generate ligand molecules for inhibition of two proteins: glycogen synthase kinase-3 beta (GSK3β) and c-Jun Nterminal kinase 3 (JNK3). The dataset, originally extracted from ExCAPE-DB (Sun et al., 2017), contains 2665 and 740 actives for GSK3β and JNK3 respecitvely. Each target also contains 50,000 negative ligand molecules. Following previous work (Jin et al., 2020;Xie et al., 2021;Nigam et al., 2021a), we adopt the same strategy of using a random forest trained on these datasets as the oracle property predictor.
In addition to the binding properties of the generated molecules, we also look at two additional factors, quantitative estimate of drug-likeliness (QED) (Bickerton et al., 2012) and synthetic accessibility (SA) (Ertl & Schuffenhauer, 2009). QED is a quantitative score that asses the quality of a molecule through comparisons of its physicochemical properties to approved drugs. SA is a score that accounts for the complexity of the molecule in the context of easiness of synthesis, thereby providing an auxillary metric for the feasibility of the compound as a drug candidate. Single property optimization is often a flawed task, because the generator can overfit to the pretrained predictor and generate unsynthesizable compounds. While we report our results on both single-property and multi-property optimization tasks, we focus our analysis on the hardest multi-objective optimization task: GSK3β+JNK3+QED+SA.
Evaluation metrics. Following previous works, we evaluate our generative model on three target metrics, success, novelty and diversity. 5,000 molecules are generated by the model, and the metric scores are computed as follows: Success rate (SR) measures the proportion of generated molecules that fit the desired properties. For inhibition of GSK3β and JNK3, this is a score of at least 0.5 from the pretrained predictor. QED has a target score of ≥ .6 and SA has a target score of ≤ 4. Novelty (Nov) measures how different the generated molecules are compared to the set of actives in the dataset, and is the proportion of molecules whose Morgan fingerprint Tanimoto similarity is at most 0.4 to any molecule in the active set (range [0, 1]). Diversity (Div) measures how different the generated molecules are compared to each other, computed as an average of pairwise Morgan fingerprint Tanimoto similarity across all generated compounds (range [0, 1]). PM is the product of the three metrics above (PM = SR · Nov · Div). Implementation details. We construct the initial set of molecules for our search algorithm from the rationales extracted from Jin et al. (2020). These rationales are obtained through a sampling process, Monte Carlo Tree Search (MCTS), on the active molecules that tries to minimize the size of the rationale subgraph, while maintaining their inhibitory properties. Rationales for multi-property tasks (GSK3β+JNK3) are extracted by combining the rationales for single-property tasks. Initializing generation with subgraphs is commonly done in molecular generative models such as Shi et al. (2020) and Kong et al. (2021). We train the RL policy using the Proximal Policy Optimization (PPO, Schulman et al. 2017) algorithm. We find the RL training robust despite both the reward function r and the initial state distribution ρ 0 are non-stationary (i.e., changing during RL training). Randomly sampled translation trajectories are shown in Figure 2. The hyperparameters used for producing the results in Section 5 are included in Appendix D.
Baseline methods. Rationale-RL (Jin et al., 2020) extracts rationales of the active molecules and then uses RL to train a completion model that add atoms to the rationale in a sequential manner to generate molecules satisfying the desired properties. GA+D & JANUS 2021a) are two genetic algorithms that use random mutations of SELFIES strings to generate promising molecular candidates; JANUS leverages a two-pronged approach, accounting for mutations towards both exploration and exploitation. MARS (Xie et al., 2021) uses Markov Chain Monte Carlo (MCMC) sampling to iterative build new molecules by adding or removing fragments, and the model is trained to fit the distribution of the active molecules. To provide a fair comparison against baselines that do not use rationales, we additionally include a baseline MARS+Rationale that initialize the MARS algorithm with the same starting initial rationale set used in Rationale-RL and our method. where possible, we use the numbers from the original corresponding paper.
Performance. The evaluation metrics are shown in Table 1; FaST significantly outperforms all baselines on all tasks including both single-property and multi-property optimization. On the most challenging task, GSK3β+JNK3+QED+SA, FaST improves upon the previous best model by over 30% in the product of the three evaluation metrics. Our model is able to efficiently search for molecules that stay within the constrained property space, and discover novel and diverse molecules by sequentially translating known and discovered active molecules. The MARS+Rationale model, which uses the same rationale molecules as the initialization for their search algorithm, does not perform well compared to the original implementation, which initializes each search with a simple "C-C" molecule. Search efficiency. While being the best-performing model, FaST is also more efficient in terms of both the number of molecules searched over through the optimization process and running time. For the GSK3β+JNK3+QED+SA results reported in Table 1, FaST on average searched through 71k molecules in total to gather the 5k proposal set. As a comparison, the strongest baseline method MARS searched through 600k molecules to obtain its corresponding 5k proposal set (∼ 9× our numbers). Our implementation is also very efficient in terms of wall-clock time -on average taking 1 hour to finish searching for the 5k proposal set in all reported tasks, on a machine with a single NVIDIA RTX 2080-Ti GPU. On the other hand, the best baseline method, MARS, takes 6 hours to complete the search.
Optimize for different novelty/diversity metrics. The Morgan fingerprints used for similarity comparison contain certain inductive biases. Under different applications, different novelty/diversity metrics may be of interest. To demonstrate the viability of our model under any metrics, we train FaST using Atom Pairs fingerprints (Carhart et al., 1985) on the GSK3β+JNK3+QED+SA task. The results, and discussion of the different fingerprint methods, are reported in Appendix C. We find that FaST can still find high-quality molecules that are novel and diverse, while the baseline methods do not. This is unsurprising because the baseline models are not novelty/diversity-aware during training, and they may suffer more when the novelty/diversity constraints are harder to satisfy.
Performance on Penalized logP. To demonstrate that our method is generally applicable to any molecular optimization task, we also include our results on the standard constrained penalized logP optimization task in Appendix B. We show that our model significantly outperforms all baselines in this task under different constraint levels. We also provide insight on the task itself: while this task has been studied in many previous works, the task, as it is currently defined, is not entirely chemically meaningful. Additionally, one drawback of this task is that a model can achieve high performance by simply generating large molecules. A detailed discussion can be found in Appendix B.

ABLATION AND ANALYSIS
Diversity of generation. In addition to the fingerprint diversity metrics presented in Section 5, another way to judge molecular diversity is to look at functional group diversity. We extract all FaST MARS (c) Figure 3: (a) compares the number of steps needed to reach 5,000 molecules for the GSK3β+JNK3+QED+SA task for different ablation models. Using the VQ-VAE greatly improves the efficiency of the model (71k vs 122k steps). (b, c) plots the t-SNE embedding of fragments from generated compounds of our model vs. Rational-RL and MARS. The visualization shows that that our model produces a much more diverse set of fragments, which is a proxy for functional groups appearing in generated molecules.
unique molecular fragments of the 5,000 molecules generated for GSK3β+JNK3+QED+SA task for both FaST and MARS, and produce t-SNE visualization of these fragments in Figure 3b and Figure 3c. In total, we extracted 1.7k unique fragments from our model outputs vs only 1.1k unique fragments for Rational-RL and 500 unique fragments from MARS. The visualization shows that the fragments in the molecules generated by our model spans a much larger chemical space. This confirms the advantages of using a learned vocabulary, compared to using a fixed set of fragments, as we are able to utilize a much more diverse set of chemical subgraphs.
Benefit of distributional vocabulary. To investigate the benefit of using a distributional vocabulary, instead of using the pretrained VQ-VAE, we also train our model using a fixed vocabulary of fragments, which consists of 56k unique fragments (the same set used to pretrain the VQ-VAE). Figure 3a compares the performance of the two models. On average, the model with fixed fragments took 122k steps, while the VQ-VAE only took 71k steps to find a set of 5,000 good molecules (72% improvement). We further analyze the benefit of using discrete latents with a VQ-VAE rather than continuous latents with a Gaussian prior VAE in Appendix A.

CONCLUSION
We propose a new framework for molecular optimization, which leverages a learned vocabulary of molecular fragments to search the chemical space efficiently. We demonstrate that Fragment-based Sequential Tanslation (FaST), which adaptively grows a set of promising molecular candidates, can generate high-quality, novel, and diverse molecules on single-property and multi-property optimization tasks. Ablation study shows that all components of our proposed method contribute to its superior performance. It is straightforward to adopt FaST to other molecular optimization tasks by modifying the fragment distribution and desired properties. Incorporating FaST to more practical drug discovery pipelines while taking synthesis paths in mind is an exciting avenue for future work.

A VOCABULARY LEARNING THROUGH VQ-VAE
To evaluate the benefits of VQ-VAE over a typical VAE trained with Gaussian priors, we train both models, and look at the distribution of fragments. Figure 4a compares the t-SNE distributions of the two models, where we sample 2,000 fragments from each model. The VAE model has tight clusters, while the VQ-VAE model exhibits a much more diverse set of fragments. We visualize random samples from VAE and VQ-VAE Figure 4, where we see that the samples from VAE are relatively simple and generic fragments, while samples from the VQ-VAE demonstrate diverse patterns. This is because the more generic fragments appear more frequently in real molecules, and a Gaussian prior over the fragment latent space would favor these fragments.

B CONSTRAINED PENALIZED LOGP TASK
To demonstrate the general applicability of our model for any molecular optimization task, we also run our model on another constrained optimization task, here optimizing for penalized octanol-water partition coefficients (logP) scores of ZINC (Irwin et al., 2012) molecules. The penalized logP score is the logP score penalized by synthetic accessibility and ring size. We use the exact computation in You et al. (2018), where the components of the penalized logP score are normalized across the entire 250k ZINC training set. The generated molecules are constrained to have similar Morgan fingerprints (Rogers & Hahn, 2010) as the original molecules.
Following the same setup as previous work (Jin et al., 2019b;You et al., 2018;Shi et al., 2020;Kong et al., 2021), we try to optimize the 800 test molecules from ZINC with the lowest penalized logP scores (the initial set I). Specifically, the task is to translate these molecules into new molecules with the Tanimoto similarity of the fingerprints constrained within δ ∈ {.4, .6}. This task aims for optimizing a certain quantity (instead of satisfying property constraints) and is a translation task (need to stay close to original molecules rather than finding novel ones). To run FaST on this task, we apply the following changes to the reward function, the qualification criterion, and the episode termination criterion, of FaST. We denote score(x) to be the penalized logP scoring function, and sim(·, ·) to be the Tanimoto similarity between two molecules: • reward r = score(x j ) − score(x i ) for any transition from molecule x i → x j • C, the discovered set contains all explored molecules that satisfy Equation (5), where the threshold is given by the input parameter δ • We terminate an episode when the number of steps exceeds 10.
The model generates a molecule with repeating aromatic rings; though not realistic, this molecule achieves a high score, while having close Tanimoto similarity using Morgan fingerprints.
Results are shown in Table 2; our method greatly outperforms the other baselines, but we point out a few flaws intrinsic to the task. Because the similarity is computed through Morgan fingerprint, which are hashes of substructures, repeatedly adding aromatic rings can often not change the fingerprint by a lot. Nevertheless, adding aromatic rings will linearly increase penalized logP score, which allows trivial solutions to produce high scores for this task (see Figure 5). This phenomenon is noted by , but they add a regularizer to constrain the generated compounds to look similar to the reference molecules. Due to the mentioned issues, we believe this task can be reformulated. For instance, one could use a different fingerprint method so that the fingerprint similarity is not so easily exploited (see AP (Carhart et al., 1985), MACCS (Durant et al., 2002), or ROCS (Hawkins et al., 2010), or size constraints should be incorporated. Nevertheless, we provide our results for comparison to other molecular generation methods.
In general, the task of optimizing (increasing) the penalized logP scores is not entirely meaningful. According to Lipinski's rule of five (Lipinski et al., 1997), which are widely established rules to evaluate the druglikeness of molecules, the logP score should be lower than 5. So an unbounded optimization of logP has little practical usability. Perhaps a better task would be to optimize for all 5 rules in Linpinski's rule of five which includes constraints involving the number of hydrogen bond donors/acceptors and molecular mass.

C DIFFERENT NOVELTY/DIVERSITY METRICS
FaST is capable of optimizing for different novelty/diversity metrics. In this section, we compute the novelty/diversity metrics using atom-pair (AP) fingerprints (Carhart et al., 1985). While Morgan fingerprints have successfully been applied to many molecular tasks such as drug screening, it has some failure modes (Capecchi et al., 2020). Namely, Morgan fingerprints is often not informative about the size or the shape of the molecules. These properties are better captured in AP fingerprints, as AP fingerprints account for all atom pairs, including their pairwise distances. We run the same experiment on the GSK3β+JNK3+QED+SA task described in Section 5, but change the fingerprint from Morgan to AP for the novelty/diversity metrics. The results are shown in Table 3 with comparison to baselines. We observe that our method outperform baselines by a greater margin, especially in the novelty metric. This is not surprising because our model can explicitly optimize for any similarity metric, while the baseline methods are not novelty/diversity-aware during training. Interestingly, we find that optimizing for AP fingerprints also yields molecules that score high under Morgan fingerprints for this task (but the converse is not true).   We provide more example molecular optimization trajectories of our model on the GSK3β+JNK3+QED+SA task in Figure 6.