A neural network model informs the total synthesis of clovane sesquiterpenoids

Efficient syntheses of complex small molecules, such as bioactive natural products, often involve detailed retrosynthetic planning and experimental evaluation of speculative synthetic routes. The central challenge of such an approach is that experimental evaluation of high-risk strategies is resource intensive because it requires iterative attempts at unsuccessful strategies. Along with the rapid development of cheminformatics and artificial intelligence, computer-aided synthetic planning has emerged to address this challenge. Herein, we report a complementary strategy that combines human-generated synthetic plans with computational prediction of the feasibility of key steps in the proposed synthesis. A neural network model (NNET) was trained on a literature-based dataset (from Reaxys) to predict the outcome of a generally disfavoured transformation, 6-endo-trig radical cyclization. The model performance was rigorously tested by experimental validation. On the basis of the virtual screening of potential substrates with our NNET model, optimal disconnections and structural modifications were chosen, resulting in five- to eight-step syntheses of three clovane sesquiterpenoids. This work establishes how a machine learning model informs human design and guides multistep syntheses of complex small molecules. Complex molecule synthesis involves speculative retrosynthetic planning and resource-intensive experimental evaluation. Now, a complementary strategy is reported that combines human-generated synthetic plans with computational prediction to accelerate this process. A machine learning model was trained to predict the yield of radical cyclization and guide the syntheses of clovane sesquiterpenoids.

The synthesis of small molecules is integral to a variety of disciplines, from materials science to molecular devices to medicinal chemistry. For complex small molecules, efficient chemical synthesis requires detailed retrosynthetic planning 1 and experimental evaluation. These plans usually involve one or more key steps that generate remarkable structural complexity. When key steps initially fail, different iterations of the key step are attempted, which is time and resource intensive to the extent that strategies are sometimes abandoned. This process has unfortunately been necessary because nuanced changes in substrate structure often result in notable changes in chemical reactivity that are challenging to predict.
One exciting approach to address the challenges associated with synthetic design is computer-aided synthetic planning 2-8 wherein computational approaches are used to provide synthetic routes. However, creative human-designed plans are valuable and crucial, especially in the context of highly complex small molecules. Herein, we report a complementary strategy that combines creative human-generated synthetic plans with robust computational analysis to predict the feasibility of key steps in proposed syntheses. Specifically, we report the development of a neural network model (NNET) that is used to evaluate human-generated synthetic strategies towards clovane sesquiterpenoids by predicting the yields of key 6-endo-trig radical cyclization steps only on the basis of chemical structures (Fig. 1a). Efficient iterative virtual screening enabled us to choose ideal synthetic routes for multiple targets, which demonstrates the successful application of a machine learning (ML) model to guide target-oriented synthesis. Moreover, the success of this strategy argues for broader use of computational tools as part of the process for synthetic planning, and through this human-computer collaboration we highlight how human and computer planning need not be at odds. Article https://doi.org/10.1038/s44160-023-00271-0 complex molecule synthesis 17 . However, the use of 6-endo radical cyclizations has been limited because it is difficult to predict their outcomes 18 . Baldwin's and Beckwith's rules 19 and other methods of analysis can in some cases suggest trends for related systems, but cannot quantitatively inform the outcome of diverse proposed transformations. A more sophisticated prediction of synthetic feasibility of this transformation is enabled by the ML model described herein, which removes the obstacle of using the often unfavourable 6-endotrig radical cyclization as the key disconnection for the synthesis of complex molecules.
To develop and apply an ML model to complex molecule synthesis, we devised the following workflow ( Fig. 1b): (1) a library of literature examples from Reaxys was collected and annotated with chemical descriptors from simple and readily conducted density functional theory (DFT) calculations; 20 (2) different ML model architectures were trained and evaluated for predictive performance; (3) human-generated retrosynthetic disconnections were evaluated using the trained Clovane sesquiterpenoids exist widely in both terrestrial and marine organisms 9 and show diverse biological activities: for example, clovan-2,9-dione (1) and rumphellclovane B (3) inhibit production of superoxide anion and inhibit elastase released by human neutrophils; 10,11 clovanemagnolol (6) exhibits excellent neurotrophic activity at concentrations of 10 nM (ref. 12 ). Intact clovanes share a bridged-ring skeleton with three quaternary centres and have been a subject of synthesis since the 1960s (refs. 13,14 ). Stunning biomimetic semisyntheses of clovanes have been reported 12,15 , but semisynthetic approaches provide limited opportunity for deep-seated structural modifications 16 . All known syntheses initially mark the B ring for preservation 14 . Thus, we proposed a new high-risk but high-reward strategy that initially dissects the B ring with a late-stage 6-endo-trig radical cyclization. This de novo strategy provides flexible entry and access to diverse clovanes that complement those available from semisynthetic approaches.
Radical cyclizations constitute a powerful method for the construction of sterically hindered systems and are commonly used in Article https://doi.org/10.1038/s44160-023-00271-0 ML model and (4) for the selected disconnection, substituents and functional groups were virtually screened with the model. The feasibility of using ML to enable the total synthesis of clovanes is supported by complementary research in synthetic methods development using chemoinformatics [21][22][23][24] . These workflows inspired our efforts, but none of them could be directly applied to complex molecule synthesis. The major differences are summarized here: (1) the substrates used in synthetic methodology development are readily available for experimental screening and high-throughput experimentation, whereas substrates involved in complex molecule synthesis require time consuming multistep synthetic operations to obtain; (2) similar substrates, ligands or catalysts often appear in multiple instances throughout the libraries used for synthetic methodology, which cover a relatively narrow region of chemical space, whereas the substrates and products in our radical cyclization library are highly diverse and (3) the datasets generated from a single source (such as high-throughput experimentation) or a small number of literature references are relatively homogenous, whereas datasets derived from highly heterogenous sources possess potentially challenging variability 25 . The success of our effort demonstrates that a synthetically useful model can be developed by carefully selecting reliable data from search engines (such as Reaxys and SciFinder) without resorting to experimentally generating new datasets.
Although a purely DFT approach was successful for substrate selection in the case of the total synthesis of paspaline A and emindole PB 26 , methods that evaluate energies of multiple intermediates and transition states would be challenging for this radical cyclization if the entire pathway needed evaluation. It is generally assumed that the 5-exo mode of cyclization is kinetically favoured whereas the desired 6-endo radical cyclization intermediate is thermodynamically favoured (Fig.  2a). Therefore, a rapid calculation would be to examine the ground state energies after cyclization and optimize for the thermodynamic favourability of the 6-endo cyclization. However, it was unknown whether greater thermodynamic preference (ΔG rxn ) would result in higher yield of the 6-endo product 27 . To investigate this possibility, the experimental yields of more than 100 literature reactions were plotted against their computed free energies of reaction (ΔG rxn ) in Fig. 2b. The lack of a correlation suggests that yield is determined by many factors in addition to ΔG rxn . It was thus proposed that a multiparameter ML model would allow for accurate yield predictions of 6-endo-trig radical cyclizations, which was needed to evaluate synthetic feasibility.
With this hypothesis in mind, we first obtained a library of literature examples of 6-endo-trig radical cyclizations from Reaxys. Reactions were limited to C(sp 3 )-centred radicals undergoing intramolecular cyclization onto a pendant olefin, resulting in a set of 99 reactions, which include a fairly even distribution of yields from 0 to 90%. For each reaction in the library, radical intermediates before and after cyclization were subjected to simple and rapid DFT calculations (uB3LYP/6-31g(d)) of physical descriptors 20 . A total of 340 descriptors per reaction were extracted to constitute the input parameters, including molecular, atomic, steric descriptors and linear combinations (Supplementary Table 4). Next, the library was split into training and test datasets (70/30) by the Kennard-Stone sampling to guarantee that the maximal breadth of feature space is covered in the training data 24 . As a large number of descriptors (340) were used relative to the small library size (99), overfitting was a major concern. Therefore, feature selection with correlation filtering (cut-off of 0.90) and dimensionality reduction with principal component analysis (threshold of 0.90) 28 were used to transform 340 descriptors into 20 parameters.
An array of supervised ML models was tuned with tenfold crossvalidation on training data and then were evaluated against the test dataset to provide R 2 and MAE (mean absolute error) values. As shown in Fig. 2c, SIMPLS (statistically inspired modification of the partial least squares) and kNN (k-nearest neighbours) algorithms showed moderate predictive performance on the test dataset with R 2 values of 0.56 and 0.59, respectively. A random forest model provided better performance with R 2 = 0.79. A single hidden layer NNET delivered improvement over these methods, providing an R 2 value of 0.82, with an MAE of 12.1%. While using one or a few chemical descriptors could not allow for useful predictions to be made, the use of many more features did allow for useful predictions. This may be a function of the underlying importance of many possible factors that determine the efficiency of such radical cyclizations.
To evaluate the soundness of our NNET model, tenfold cross-validation and leave-one-out cross-validation (LOO-CV) were conducted on the whole library, providing slightly higher mean errors of 14.2% and 14.4%, respectively; the decreased Q 2 LOO-CV (0.59) may be an indicator of overfitting, which prompted a need to further evaluate the use of the model to make meaningful predictions. For this reason, an experimental validation study was conducted and is described at the end of the paper. Two additional control experiments were conducted (Fig.  2d): Y-randomization, in which yields are randomly shuffled across the dataset and a random data test, in which chemically meaningful descriptors are replaced with randomly generated values 29 . The low correlations observed (R 2 = 0.02 and 0.01, respectively) suggest that the predictions of our NNET model were achieved by identifying relationships between yield and chemically meaningful featurization, rather than by finding chance correlations. To test the model's ability to extrapolate beyond the template library 24 , literature validation was conducted with an additional 26 examples of 6-endo radical cyclization from Reaxys and SciFinder; these substrates contained special functional groups that are not represented in the training or testing datasets, such as -CF 3 substitution or heteroatoms (N, O) within the formed six-membered ring. We were pleased to find that reasonable correlation was observed, even though those key intermediates do not lie within the chemical space covered by the training data ( Supplementary Fig. 13). The lower correlation (R 2 = 0.63) and higher MAE (15.7%) indicate the limitations of our model, but even for this alternative substrate type, the degree of correlation could be useful in some contexts. For the purposes of clovane sesquiterpenoid synthesis, which have an all-carbon skeleton, it was not necessary to have high performance for these substrate types. Moreover, the reasonable performance of extrapolation further suggests the model identifies chemically meaningful information from physics-based features.
With the trained NNET model, different disconnections of the B ring corresponding to different synthetic routes to clovan-2,9-dione (1) were evaluated (Fig. 3a). The predicted yields of 6-endo-trig radical cyclizations from precursors 7, 8 and 9 are 26%, 46% and 34%, respectively. Due to limited available precedent for cyclizations of this type 18,30 , conventional logic would have discouraged those disconnections and 5-exo products would have been anticipated, but the model's encouraging predictions for 8 mitigated that concern. Conventional disconnections to favour the 6-endo cyclization, such as the use of an enone instead of an alkene as the acceptor, did not provide significantly higher yield predictions (Supplementary Fig. 10). Ultimately, precursor 8 was selected, as it has a synthetically useful predicted yield and represents an innovative disconnection 14 that leads to greater synthetic accessibility 31 and more ready diversification to a variety of clovanes.
The next consideration investigated which proximal and remote functionality would be the optimal choice for the substrate given synthetic accessibility, predicted efficiency and use in accessing a variety of clovanes. A selection of substrates (10)(11)(12)(13)(14), which would readily lead to other clovane natural products, from more than 100 predictions ( Supplementary Fig. 11) is shown in Fig. 3a to illustrate the planning considerations that were made. For example, the introduction of an additional carbonyl group in triketone 11 has a higher predicted yield that is qualitatively in line with expert intuition. Meanwhile, other modifications at sites distal to the reaction site lead to limited variability and uniformly synthetically useful yields are predicted.
As shown in Fig. 3b, the synthetic route via radical intermediate 8 to clovan-2,9-dione (1) starts from commercially available After a series of optimization of reaction conditions (for example, temperature, concentration, solvent and reaction time), the 6-endo radical cyclization of 19 was realized with the highest yield of 45%, providing clovan-2,9-dione (1) and 5-exo product in a ratio of   311++(d,p)). b, Computed free energies of 6-endo-trig radical cyclization (ΔG rxn , uB3LYP/6-31g(d)) do not correlate with cyclization yields. c, Performance of different ML algorithms on the test dataset for the yield predictions of 6-endo-trig radical cyclization. d, Control experiments and extrapolation for the optimal NNET model. R 2 , coefficient of determination; SIMPLS, statistically inspired modification of the partial least squares; kNN, k-nearest neighbours; RF, random forest; CV, cross-validation and LOO-CV, leave-one-out cross-validation.
Article https://doi.org/10.1038/s44160-023-00271-0 resulted in a more efficient five-step synthesis of 1, compared to the previously disclosed 15-step racemic strategy 13 . In addition, enantioenriched 17 could easily be prepared through a Corey-Bakshi-Shibata reduction (Fig. 4a) and re-oxidation sequence, leading to an eight-step asymmetric synthesis (previously completed in 17 steps) 13 . As shown in Fig. 3a, the feasibility of radical cyclization of 10 was evaluated by our NNET model with a predicted yield of 51%. The experimental success of this transformation (from 23 to 24, Fig. 4a) enabled the first total syntheses of rumphellclovane A (26) 33 and canangaterpene II (2) 34 in eight steps from commercially available 15. The key elements of the synthesis are selective reduction of 17 and a late-stage Baeyer-Villiger followed by selective transesterification (24 to 26). The structure of canangaterpene II (2) was revised from the previously proposed structure on the basis of biosynthetic considerations 15 , nuclear magnetic resonance spectroscopy calculations and our synthesis of the revised structure (see Supplementary Information for details).
To rigorously test the model performance, we examined an additional seven radical precursors as an experimental validation set, including clovane-type precursors (8,(10)(11)(12) and also an alternative framework (27)(28)(29). As shown in Fig. 4b, experimental yields are within the calculated error of the model and showed an excellent correlation (R 2 = 0.89) to the ML-predicted yields with a small MAE (6.3%). Substrates 27 and 29, which had low predicted yields of the 6-endo products, gave 5-exo as the major products (see Supplementary Information for details). All reactions were repeated more than twice, and the largest range in yield (12, 47-53%) was 6%, which we take as the error associated with the experimental yields. One of the challenges associated with evaluating model performance is that reported yields in organic synthesis can be variable, yet experimental error is not evaluated or reported. We suggest researchers adopt the practice of including errors to assist later data science efforts.
Moreover, with the model reported herein, dozens of substrates can be evaluated in one day, whereas accurate DFT calculations of the full pathway for more than 100 substrates would be computationally intractable for practical time scales, as a single substrate may require weeks. The substantial time investment required to conduct DFT calculations poses an obstacle to incorporating calculations in synthetic planning; the ability to rapidly apply a ML-based model foreshadows broader future use of such computational tools in synthetic planning.
In summary, this report describes a platform that combines creative human-generated synthetic plans with robust computational analysis for a challenging key step. ML models are trained from readily accessible literature examples (from Reaxys and SciFinder) to predict the yields of a generally disfavoured chemical transformation (6-endo-trig radical cyclization). An NNET model was used to guide the retrosynthetic analysis of several sesquiterpenoid natural products, resulting in their highly efficient syntheses. We expect that models for other transformations could be developed following this workflow,

Data availability
The data supporting the findings of this study are available within the paper and its Supplementary Information.

Code availability
All code used to support the findings of this work is supplied as Supplementary Information. The code is also available on GitHub (https:// github.com/Newhouse-Group/6-Endo-Radical-Cyclization). Source data are provided with this paper.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s44160-023-00271-0. • Total syntheses of rumphellclovane A and canangaterpene II (eight steps) • Revised structure of canangaterpene II (2) VI'. K-selectride 69%