Learning relationships between chemical and physical stability for drug development

Chemical and physical stabilities are two key features considered in pharmaceutical development. Chemical stability is typically reported as a combination of potency and degradation product. For peptide products, it is common to measure physical stability via aggregation or fibrillation using the fluorescent reporter Thioflavin T. Executing stability studies is a lengthy process and requires extensive resources. To reduce the resources and shorten the process for stability studies during the development of a product, we introduce a machine learning based model for predicting the chemical stability over time using both the formulation conditions as well as the aggregation curve. In this work, we explore the relationships between the formulation, stability time point, and the measurements of chemical stability and achieve a coefficient of determination on a random test set of 0.945 and a mean absolute error (MAE) of 0.421 when using a multilayer perceptron (MLP) neural network for total degradation. Similarly, we achieve a coefficient of determination of 0.908 and an MAE of 1.435 when predicting the potency using a random forest model. When measurements of physical stability are included into the model, the MAE in the prediction of TD decreases to 0.148 for the MLP model. Using a similar strategy for the prediction of potency, the MAE decreases to 0.705 for the random forest model. Therefore, we can conclude two important points: first, chemical stability can be modeled using machine learning techniques and second there is a relationship between the physical stability of a peptide and its chemical stability.


Introduction
Peptides are an important class of biomolecules that have been rapidly developed as therapeutical agents that cover areas such as metabolic disorders, skin infections, and oncology. 1,2 Among many challenges in developing acceptable commercial peptide drug product is to demonstrate acceptable chemical and physical stabilities. This is accomplished through systematic stability studies, in which peptides as active pharmaceutical ingredient (API) or in the formulation matrix are stored at certain conditions, their chemical and physical stabilities are monitored over time to understand the effect of environmental parameters such as temperature, moisture and light. 3,4 There are reported computer programs such as ASAPprime 5 utilizing humidity corrected Arrhenius equation, which predicts the stability of APIs using storage conditions such as temperature and humidity. 3,5 Potency, individual degradation product and total degradation product are key quality attributes of peptide drug products and studied in chemical stability. 6 High performance liquid chromatography (HPLC) is one of the most common analytical techniques for measurement of potency and degradation product. Peptides, especially in solution during storage or manufacture, can undergo conformational change from α helical structure to β sheets, then aggregate or form fibrils. These pathways of physical instability are undesirable due to their potential risk of immunogenic responses and impacting bioavailability. 7,8 Thus, aggregation is a key physical stability parameter of formulated peptide product and need to be evaluated. One method for measuring the aggregation of peptides is through the use of Thioflavin T (ThioT). The knowledge gained from stability studies will be used to guide the design of formulation, establishment of shelf life and determination of proper storage and packaging conditions. Depending on the stage of drug development, the duration of stability study ranges from a couple of months to several years, which requires tremendous number of resources, and the process is lengthy.
Recently, chemists have started to apply machine learning techniques to model various chemical processes for catalyst design 9,10 , prediction of organic reactions 11,12 , reactions in a mass spectrometer 13 , and for the analysis of chemical spectra. Moreover, Lai and coworkers have developed machine learning based methodologies to determine molecular features which are responsible for the viscosity behavior and the aggregation of therapeutic antibodies. 14,15 In short, machine learning is a process where an algorithm is given input variables (x) and an output variable (y). This algorithm then adjusts a set of weights, biases, and other parameters to predict the output variable from the input variables. Given enough weights and an optimized algorithm, the result will be a model that can always predict the output from the inputs. Therefore, one needs to ensure that a given model is robust with high predictive performance on data that was not used to directly train the model. A popular technique for ensuring that this scenario is the case is referred to as k-fold cross validation where the available data is divided into various portions called folds. 16 For example, the available data can be divided into 5-folds each containing 20% of the entire data set. One of these folds is then removed and referred to as the validation set and the remaining training set is used to train the model. The model is then evaluated on the validation set and the performance of the model is calculated. This process is repeated until all the folds has been used as the validation set and statistics can be calculated on how well the model did for the various folds.
One interesting application of machine learning in chemistry is the ability to understand how underlying variables (x) relate to the output (y). The use of machine learning to understand these variables has been applied to understanding how IR and MS spectra relate to functional groups 17 and how functional groups in organic molecules relate to chemical reactivity 13 . When performing this type of analysis there are two important points to consider. First, machine learning models can only learn from the variables they are given, and second machine learning models can "overfit" to a particular data point set of variables 18,19 . This second point can be addressed by using cross-validation. Recent works have shown that machine learning models built on small training sets can be successful for both understanding chemistry and for useful predictions. 13,[20][21][22] To our knowledge, these techniques have not been used to understand the relationships between storage conditions, chemical stability, and physical stability in drug formulation. In this work, we will outline how an initial design of experiment (DoE) was used to generate training data for both the chemical and physical stability of a peptide drug molecule in formulation matrix.
Then, we will use the training data from this DoE to create machine learning models to predict the chemical stability from both the storage conditions and the physical stability to develop a relationship between these variables.

Stability study for Peptide A
Pharmaceutical development was triggered for Peptide A, sequence H{d}SQGTFTSDK(γEγEC16)SKYLDARAAQDFVQWLLDT-NH2, 23 necessitating the design of a robust and safe formulation with optimized chemical and physical stability. Early in development it was determined that Peptide A had risks in both its chemical and physical stabilities. The primary chemical degradation products were aspartic acid isomerization at multiple sites and reactivity at the N-terminal histidine. Fibrillation was identified as the primary mechanism of physical degradation. Optimizing the formulation for Peptide A was a balancing act due to the contrasting pH dependent trends necessary for chemical and physical stability. The chemical stability and key degradation products were observed to be accelerated under more alkaline conditions for the pH range evaluated (7.0-8.7). In contrast, the physical stability was inverted where Peptide A was stabilized by more alkaline conditions tending to form fibrils at neutral or acidic pH's.
We sought out to design an optimized formulation by setting up a stability study composed of Peptide A formulations spanning a wide range of pH's, buffers, and excipient concentrations. The stability study utilized both long term (5°C) and accelerated (25°C, 40°C) conditions with a target storage condition of being refrigerated. Given that pH was closely linked to chemical and physical stability formulations were designed to explore a wide pH range of 7.5-8.7 to define a formulation with an optimized pH. In addition to pH, compositions of multiple excipients were varied including histidine (0 to 10 mM) and a proprietary excipient B (EB) (0-0.5% w/v) which were targeted to improve chemical and physical stability, respectively. The addition of histidine was to prevent the growth of a specific degradant and is not expected to have a dramatic effect neither on the potency nor the total degradation of the peptide. Therefore, the addition of this excipient can be used as a case study for changes to the formulation that are not designed to make a dramatic impact on the output variables. The goal of the stability study was to optimize the pH and excipient compositions and determine if it was achievable to design a formulation with acceptable physical and chemical stability over the course of Peptide A's shelf life.
The values measured in the above DoE experiment are easily translated into features usable for a machine learning model. The pH, %EB, and output of the model are all numeric features of the model. While it would be possible to treat some of these values as one-hot encoded vectors, the true utility of the model would be to use for the optimization of these parameters. Since this ability is not possible using a one-hot encoded vector, we opted to treat these values numerically.
Additionally, the natural choice to treat the prediction of potency (measured using the percent label claim of the product, %LC) and Total Degradation (TD) as numeric values forces the creation of regression models. However, the buffer type used as well as the presence of histidine were treated as factors. The first choice is required as the use of a phosphate buffer versus a tris buffer cannot be modeled numerically. The second choice is also required as only concentration of histidine was used in the DoE experiment.
Interpretable machine learning for predicting potency and total degradation. Since both TD and %LC are represented to these models as a number, these models were treated as regression problems. We decided to use the coefficient of determination (R 2 ) to compare the performance of the models and select the best model from the data as this measure is easily understood by analytical chemists. More conventional measures of error such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) were calculated as well. The equations for these measures of error are given below where # is the experimentally measured observation and is the value predicted by a model. Additionally, $ is the mean of the experimentally measured values.

Model training, validation, and testing
To evaluate the performance of the model on an external test (blind) set, the data available for modeling was split into a training and test set. Assignment of an observation to either set was random and 80% of the data was set aside for the training set while 20% was used for the test set. Then, the training set was then split into a training and validation set as per the crossvalidation procedure and the above three measurements of model performance (R 2 , MAE, RMSE) are calculated for the validation set for a given set of hyperparameters (i.e. parameters used to control the training performance of model itself). The model and associated hyperparameters which performs the best (as measured by these metrics) is then selected and a new model is trained using the same method and hyperparameters using the entire training set. The test set identified at the beginning of the procedure was then used to evaluate this final model.

Results and Discussion
Total degradation is linear with respect to time point while label claim is not linear.
Before attempting to model the data, we examined how the total degradation (TD) and percent Interestingly, all five samples with a high R 2 were stored at the same pH and for these storage conditions, the %LC was less than or equal to 80% at the 24-month time-point. All other pH storage conditions had a %LC greater than 90%, therefore these 5 high R 2 values may be false indicator of linearity resulting from a strong outlier affect. While R 2 values alone cannot be used to determine whether a relationship is best modeled by a linear function, these results indicate that the TD measurement is more linear with respect to time than %LC. It should also be noted that some samples with low potency were noted to have a high degree of fibrillation, leading to samples with a high viscosity. This made the samples difficult to analyze and is not a good measurement of the true chemical degradation and problematic samples were removed from the available training data. In the next section, we explore different machine learning models for predicting these two measurements of chemical degradation.
Regression models predict total degradation.
The first machine learning models we investigate are interpretive models known to perform well  Figure 3a and Table S1 with additional metrics shown in Figures S1. It can be seen that SVM R , RF, and MLP models out-perform the other models and within error of each other. For these models, the performance of models trained on 50% of the available data (k=2) is similar to those trained on 95% of the available data (k=20). This indicates that these models can be trained with a small amount of data and still be predictive. Since these models all perform within error of each other, we compared the true TD to the predicted TD for each fold for the three models. The results for k=2 cross-validation is shown in Figures 3b-d and for all cross-validation schemes in Figures S2-S4. These results show that the MLP model is best at producing a one-to-one prediction with the true TD. The RF and SVM R models fail to predict large TD values (above 7.5%) while the MLP model is able to predict the TD for these values.
Therefore, we conclude that the MLP model performs best for this form of chemical degradation.
Given the success of the MLP model, we decided to investigate the internals of the model to attempt to gain insight. The entire model is displayed in the top of  Figure 4). 32 It is not surprising that the most important variable is the Time-point, followed by the pH (see the above discussion of Figure 2). While these results do not yield any insights into how these variables affect chemical degradation, they do show that this type of analysis can be used to reveal known relationships between the input variables and the output variables.
Results of the Train-Validate-Test paradigm are given in Table 2. The MAE of the MLP model is 0.421 for the test set and 0.267 for the full training set, indicating that a tuned version of this model performs well for data not available during training. This represents a mean percent error of 15.6% for the prediction of TD for time-points past the first month of the study given the average total degradation past this is 3.0%. Therefore, we would like to further optimize the prediction of TD using additional information, and idea which will be discussed in a later section of this work.
Regression models predict percent label claim.
The successful modeling of TD using storage conditions was repeated for the prediction of potency (quantitated as %LC). In Figure 5a it can be seen that the RF model achieved an R 2 greater than 0.90 and Figure S5 indicates that the RF model has the lowest RMSE and MAE. Additionally, the relationship between the true observations and predictions does not show any significant patterns (Figure 5b, Figure S6). For other interpretable machine learning models, points with true low %LC values are predicted to be too high (Figures S7-S8). These results indicate that the RF model is able to capture the relationship between the input variables and %LC. An analysis of the variable importance is given in Figure 5c. As with the prediction of TD, the most important variable for the prediction of %LC is the Timepoint of the data. The second most important variable is the type of buffer used to store the peptide. This result is interesting as the only pH condition associated with a change in buffer type is the pH 8.7 condition, indicating that the RF model is using this variable as a surrogate for the highest pH condition. Since the highest pH condition shows the greatest decrease in %LC, this selection is justified. The third most important variable is the pH, which likely explains the remaining variance in the %LC curve not already covered by the selection of buffer type. Finally, the EB is least important variable that is used by the RF model and likely adjusts for small changes in %LC. It is interesting to note that the amount of histidine is not used in the model to predict %LC and is assigned an importance of zero, similar to the result from the TD model.
As with the TD predictions, the Train-Validate-Test paradigm 16 was used to evaluate the performance of the tuned model, and the results of these tests is given in

Incorporating physical degradation as an input feature for chemical degradation
To improve the prediction results for the %LC and TD models, we decided to include the physical degradation of the peptide as measured by the ThioT curve measured at each time-point for the storage conditions (data shown in Figure 6 for the initial (6a), 1-month (6b), 6-month (6c) and  Table 4 for MLP, RF, and SVM models.
All three investigated models are able to predict the TD using solely the ThioT data as input. The functions are compared with surfaces generated by true data in Figure S9 and Figure S10 for TD and %LC, respectively. Therefore, we show that the learned functional surfaces have a very good correlation with true data. Moreover, the ML learned function by the RF model trained on both ThioT curve data and storage conditions in Figure S11 also shows similar relationship to the surfaces shown in Figure 8. A limitation of this work is that it does not determine how physical and chemical degradation are related on a molecular level. In future works, we will use molecular mechanics to investigation this relationship to develop detailed mechanistic models.

Conclusion
We does not appear to be important for modeling the chemical degradation of the peptide. We believe that these modeling efforts are useful for understanding the degradation of Peptide A as a function of its storage conditions. In conclusion, we provide a framework to develop the relationships between chemical and physical degradation for future studies in drug development and in chemical sciences.  Interpretative machine learning predicts total degradation of Peptide A as this appears to be a linear function with respect to time. Second, we will present a model to predict the change in percent label claim using changes in physical degradation as measured by a ThioT curve. Using this methodology, we will show that there is a link between the physical degradation of the peptide and the chemical degradation as represented by total degradation and percent label claim.  Table 1).

Figures and Tables
The slope of this trend is a function of pH, the type of buffer used, concentration of histidine, and the percentage of EB used. The relationship between time-point and the percent label claim is less obvious.     The ability of each model to predict %LC from the storage conditions is shown in (a). Of these models, only RF consistently exceeds an R 2 value greater than 0.90 for all cross-validation schemes. A plot of the predicted %LC value versus the true %LC value is shown in (b). These results show that the model is also able to predict %LC from storage conditions.