Selecting the appropriate features in battery lifetime predictions

SUMMARY Data-driven models are being developed to predict battery lifetime because of their ability to capture complex aging phenomena. In this perspective, we demonstrate that it is critical to consider the use cases when developing prediction models. Speciﬁcally, model features need to be classiﬁed to differentiate whether or not they encode cycling conditions, which are sometimes used to artiﬁcially increase the diversity in battery lifetime. Many use cases require the prediction of cell-to-cell variability between identically cycled cells, such as production quality control. Developing models for such prediction tasks thus requires features that do not rely on cycling conditions. Using the dataset published by Severson et al. in 2019 as an example, we show that features encoding cycling conditions boost model accuracy because they predict the protocol-to-protocol variability. However, models based on these features are less transferable when deployed on identically cycled cells. Our analysis underscores the concept of using the right features for the right prediction task. We encourage researchers to consider the usage scenarios that they are developing models for and whether or not to include cycling conditions in their models in order to avoid data leakage. Equally important, benchmarking model performance should be carried out between models developed for the same use case.


INTRODUCTION
Optimizing rechargeable batteries is a challenging and necessary task as energy storage is deployed to decarbonize transportation and the electricity grid.The battery design space is large, spanning chemistry, architecture, manufacturing processes, and usage conditions.Besides, battery lifetime is long, and its evaluation is time and resource consuming, generally lasting several months to years, even under accelerated testing conditions.Over the past few decades, there have been substantial efforts to shorten battery lifetime evaluation, such as developing accelerated testing protocols (typically under elevated temperatures) 1,2 and electrochemical methods (such as high-precision coulometry). 3Physics-based models [4][5][6][7] and semiempirical models 8,9 have also been widely used to capture battery degradation trajectories.More recently, substantial progress has been made in data-driven approaches for battery lifetime prediction, typically involving machine learning (ML) methods on large datasets. 10,11ere are many use cases for battery lifetime prediction ML models, and the model development is specific to each use case.In Figure 1, we provide an overview of

CONTEXT & SCALE
Machine learning plays a critical role in accelerating materials research and optimizing battery performance.In recent years, scientists and engineers have developed accurate data-driven models to predict the lifetime of Liion batteries.The versatility of such models and their ability to shorten battery testing have drawn substantial attention from industry and academia.Feature engineering has been instrumental in achieving promising model performance.In this perspective paper, we encourage researchers to determine whether they should include features that encode cycling conditions depending on their use case.
In model training datasets, battery cell aging is related to both intrinsic variability between cells and induced variability between cycling conditions.We demonstrate that for certain applications, such as production quality control, models should only use features that capture cell-to-cell variability.In these cases, features that encode information on cycling conditions should be avoided.Through several analyses, we show that when this constraint is not upheld, model performance is artificially inflated.Using the right features for the right task is essential to building datadriven models suitable for real use cases.
several typical use cases and classify them based on whether the cycling conditions are varied when carrying out the cycling experiments.For example, a battery engineer who aims to optimize the cell design will cycle batteries identically for a fair comparison (Figure 1 scenario a). 12,13Alternatively, a model sorting short-lasting batteries out of a production line needs to rely on minimal testing as batteries cannot be cycled more than a few times (Figure 1 scenario b). 14In another scenario, a battery engineer aims to determine the impact of usage conditions, such as the depth of discharge and the charging current, on battery lifetime. 1,15,16For such a use case, an experiment with intentionally varied aging conditions is needed to build a prognosis model (Figure 1 scenario c).Such a model could be further integrated into more complex architectures to optimize over a large protocol parameter space (Figure 1 scenario d). 17,18In another use case, an electric vehicle (EV) engineer aims to integrate a prediction model in the battery management system (BMS) to estimate the battery state of health (SOH). 10,19,20Contrary to the previous scenario, cells are cycled variably during driving, and the BMS can access historical data for predictions (Figure 1 scenario e).Lastly, for battery repurposing, models do not have access to historical cycling data (Figure 1 scenario f). 21These use cases are diverse and involve cycling conditions that are either constant or variable.
Differences in battery aging trajectories are due to both cell-to-cell variability and differences in external aging factors, such as cycling rate and temperature.3][24][25][26] On the other hand, varying the cycling conditions is an artificial yet convenient way to enhance the diversity of the dataset and decrease the number of cells that need to be tested.This approach of diversifying the dataset via varied cycling conditions is analogous to accelerating aging via elevated temperature cycling, with the goal being to decrease the resources (number of cells and time) required.
In 2019, Severson et al. 11 demonstrated that ML, combined with a large dataset, is effective for predicting battery lifetime by employing data-driven feature engineering.The authors achieved accurate early lifetime predictions for 124 commercial lithium iron phosphate (LFP)/graphite cells using observations from the first 100 aging cycles.The cells were charged under different fast-charging protocols, but the discharge protocol was identical for all cells.Thus, the charging data explicitly encoded the cycling conditions, whereas the discharging data (for a given cycle) did not.][29][30][31] Features inspired by Severson's work have been used to create models that can be transferred to different datasets [32][33][34] or different chemistries. 359][40] These complex architectures can be used to improve lifetime prediction accuracy 41 or to enable predictions with fewer aging cycles. 42Both of these objectives have been used as metrics to benchmark model performance.Although ML approaches for battery lifetime predictions are powerful, they also have pitfalls. 43Among them are ''nonlegitimate features,'' 44 which are those linked or correlated with the outcome.In the ML community, the term ''data leakage'' is used when information about the target is contained in the training data but may not be accessible in real testing conditions.This results in the model performance being artificially inflated during training but then deteriorating when the trained model is deployed for real use cases.
In this perspective, we analyze the Severson dataset to demonstrate that the community needs to pay attention to which features to use, depending on the usage scenario.Using features developed by Greenbank and Howey 37 extracted on (1) the charging data, (2) the discharging data, and (3) the entire charge-discharge data, we show that models relying on features based on the charging data have substantially better prediction accuracies than models using features from the discharging data.This is because these former models directly capture protocol-to-protocol variations.In fact, we show that models using no aging data at all and only the cycling conditions give good performance.However, when deployed on identically cycled cells, models based on the charging data do not maintain the same level of prediction accuracy.More generally, we show that feeding information about the aging conditions into a lifetime prediction ML model will bias the model to learn the protocol-to-protocol variations instead of learning the cell-to-cell variations.Thus, for use cases aiming to detect cell-to-cell variability, features encoding cycling conditions are inappropriate and should be avoided to prevent data leakage.

METHODOLOGY
In the Severson dataset, the authors artificially amplified the variability between cells by changing the charging protocols (two constant current [CC] steps varied across cells, followed by a CC, constant voltage [CCCV] step).As a result, the cycle life (defined as the cycle number when the capacity reaches 80% of nominal value) varied between 148 and 2,237 cycles.The discharge conditions, on the other hand, were kept constant for all cells, providing a common diagnostic across cycling conditions.In Figure 2, we define a feature classification scheme, specific to this dataset, reflecting how these features are derived: ''Class 0'' features: solely based on cycling protocol parameters (no actual battery cycling data) ''Class 1'' features: derived from the charging data during aging, which encode charging protocols ''Class 2'' features: derived from the discharging data during aging This classification scheme is generalizable to other battery datasets.Specifically, class 1 features are those that encode aging conditions either through cycling or through calendar aging. 16Class 2 features rely solely on regions of the cycling curves that are kept constant across all cells.This can be a diagnostic or check-up cycle, performed at regularly spaced intervals, to probe the state of degradation. 16,25,45ncluding a reset cycle at the start of such check-up cycles is essential to erase any explicit information about aging conditions.In the Severson dataset, there is no reset cycle between the charge and the discharge, but the CCCV at the end of charge is kept constant across all cells and mitigates explicit data leakage to the discharge data.
The purpose of this perspective is to assess the importance of the feature classes for lifetime predictions.Thus, we need features that are derived equivalently across charging and discharging curves (e.g., using the same feature extraction routine).For this reason, we employ features developed by Greenbank and Howey. 37Briefly, these features quantify the fraction of time a given time series of interest (voltage, current, etc.) spends in a given window (e.g., 3.5-3.6V) during a given time interval (10-cycle window in our study).See supplemental information for a complete explanation of the featurization.The code used for this analysis is publicly available on Github at https://github.com/geslina.For simplicity, models using features based on the charging data, the discharging data, or the full cycling data are referred to as ''charge'' (class 1), ''discharge'' (class 2), or ''full'' (class 1) model, respectively.Finally, we derive the features at different points in the early cycles of the cells.This allows us to study how the information carried by such features evolves during the early aging of the cells.
The ML task is to predict the battery lifetime and to compare the impact of incorporating features encoding cycling conditions directly or indirectly (class 0 and class 1).We employed a regularized linear model (elastic net) and an ensemble model (random forest regressor), the latter being able to capture nonlinear correlations.Most of the discussion is based on the results from the random forest model, which provides better accuracy.To prevent overfitting and to optimize model hyperparameters, a 10-fold cross-validation was carried out systematically using the GridSearchCV class from Python package SKlearn 46 (see supplemental information for details).An 80-20 train-test split was used.Because we observe that the prediction accuracy depends on train-test splits, all of the analyses conducted here are repeated over 10 random train-test splits.To evaluate the usefulness of an ML approach, we employed a ''dummy regressor'' from the SKlearn library 46 as a baseline.For a given training set, such a model always predicts the same output value (in this study, the mean of the training set labels) regardless of the input features.

RESULTS AND DISCUSSION
Class 0 model is a decent baseline model without the need for any aging data Using solely the cycling protocols and no battery aging data (e.g., only class 0 features), we show good but artificially inflated ''early-prediction'' model performance, compared with the dummy regressor.We stress that this is not an early-prediction model but a model that predicts cycle life as a function of cycling conditions.Table 1 reports the mean absolute percentage error (MAPE) for both the linear model and the random forest model.These results show that by solely using cycling conditions and no aging data, lifetime can be predicted with 26.4% error, a significant improvement over the 41.2% error of the dummy regressor.Importantly, this confirms that cycling conditions are predictive, even without aging data, as should be expected. 15,47ass 1 features carry more information than class 2 features Next, we compare ML performance using class 1 and class 2 features (Figure 3) as a function of the number of early cycles used as inputs.The errors of the charge and full models are equivalent and substantially better than the discharge model.Importantly, the charge models work equally well using only features from cycle 1 vs. features from cycle 150.The low prediction errors of these models are explained by the fact that the charge features encode both the protocol-to-protocol and the cell-tocell variability.This makes the charge models even more accurate than the class 0 model.By contrast, the discharge models do not rely on any cycling conditions information.Thus, their predictive power is limited if only the first tens of cycles are employed for the featurization, as shown in Figure 3.However, their accuracy improves steadily with increasing early aging data, approaching the other two models' accuracy after over 100 cycles, consistent with Severson et al. 11 Additionally, Figure S4 demonstrates that all models' errors are approximately equivalent if cycling information is manually included into the features.These results confirm that a model that includes features that encode the intentionally varied cycling conditions will have an artificially inflated early-prediction accuracy.On the contrary, predicting lifetime without knowledge of the cycling conditions is a considerably more challenging task.
Class 2 models are more appropriate to study cell-to-cell variability Finally, we compare charge and discharge models when deployed on a subset of cells from the dataset that were cycled identically.This special case examines how we can predict the lifetime of cells in which variation arises purely from intrinsic differences between cells, with one use case being production quality control.Although the Severson dataset does not contain many protocols with a large number The mean MAPE and SD are calculated across 10 train-test splits.

Perspective
of cell repeats, there are 5 protocols with at least 6 repeats (Figure S5).For each of these protocols, we train and test a charge model and discharge model on the entire dataset, excluding the identically cycled cells of that protocol.We then deploy these models on the retained cells and compare the models errors (Table 2).Protocols ''5.0C-(67%)-4.0C''and ''5.3C-(54%)-4.0C''(rows 1 and 4 in Table 2) show that the discharge model performs similarly if not outperforms the charge model on identically cycled cells.This trend is opposite to the model trained on the full dataset, which includes cells from many charging conditions.Class 1 features are less capable of capturing intrinsic cell-to-cell variability than class 2 features because their predictive power relies strongly on the charging protocol encoded in the data (which is not varied in this test subset of the data).Additionally, both models deployed on cells from protocols ''5.6C-(19%)-4.6C''and ''5.6C-(36%)-4.3C''(rows 2 and 3 in Table 2) show a significant error reduction, irrespective of the model.As mentioned earlier, the test performance is highly dependent on the train-test split.These two protocols both have a narrow distribution and a mean close to that of the entire dataset (802 cycles) as shown in the parity plots in Figure S6.Such protocols may lead to ''easier'' predictions because these cells are similar to the train set cells.Such protocols provide little information regarding the ability of a model to extrapolate for datapoints outside of the train set.By contrast, in the case of protocol 5.0C-(67%)-4.0C(row 1), which has a higher spread in cycle life distribution, a model relying on the cell-to-cell variability (''discharge model'') would perform better at extrapolating.Lastly, protocol ''4.8C-(80%)-4.8C''(row 5 in Table 2) has the largest number of repeats but also contains 3 cells with a cycle life higher than 1,600 (more than two standard deviations [SDs] away from the mean).The errors of these models reported in Table 2 are dominated by the models' ability to fit outliers (Figure S6).Thus, these outliers make the performance results for that protocol less representative.
We note that the Severson dataset is not optimal to study cell-to-cell variability.First, the protocol-to-protocol variability is more pronounced than the cell-to-cell variability because of the aggressive cycling conditions (Figure S4).Second, the dataset does not contain many protocols with a large number of repeats (Figure S5).Dechent and coworkers 48,49 developed statistical methods to estimate the number of cell repeats needed to capture the underlying cell-to-cell variability of a population.They showed that, even for simple models with 3 parameters or less, at least 9-13 repeats are needed to reliably fit the data.The ML models used here rely on many more parameters, and the Severson dataset does not contain more than 9 repeats for any protocol.This emphasizes the need to design datasets tailored for specific applications.For example, to study cell-to-cell variability, datasets in Baumho ¨fer et al., 14 Rumpf et al., 22 Dechent et al., 23,24 and Weng et al. 25 are more appropriate.
analysis and results are reproducible using different sets of features For completeness, we repeated our analysis using features originally derived by Severson et al. 11 Only 7 of the 20 original features were employed; not all features could be derived identically on the charge and discharge data because the charging protocol has three steps, whereas the discharge protocol has only one.Figure S7 shows similar model performance trends as observed in Figure 3 (which uses features developed by Greenbank and Howey 37 ).Errors are noticeably higher compared with Figure 3 (+5% in error for the best-performing models), which is expected because the features are not optimized.This analysis confirms that cycling conditions information can bias models, independently of the featurization method.

CONCLUSIONS AND RECOMMENDATIONS
In recent years, the battery research community has deployed data-driven methods to predict battery lifetime.The dataset published by Severson et al. 11 among several others, is widely used for benchmarking model performance.In this dataset, charging conditions are varied to broaden the distribution of cycle life, whereas discharging conditions are kept constant across the cells.We demonstrate that a prediction model using only cycling conditions and no aging data can achieve decent (26.4% MAPE) predictions (compared with a dummy regressor, 41.2% MAPE) because cycling conditions strongly influence battery lifetime.More importantly, we show that models with features encoding the cycling conditions are more accurate than models that do not rely on cycling conditions.However, these models do not maintain the same level of performance when predicting intrinsic cell-to-cell variability among subsets of cells cycled identically.
Our results illustrate that ML models for lifetime prediction can be biased to learn the intentionally induced protocol-to-protocol variations in a dataset.In some use scenarios, such as prognosis predictions and cycling protocol optimization, this is advantageous.However, in other scenarios, such as production quality control and chemistry/cell design optimization, the objective is to detect the intrinsic variability between the cells rather than how cycling conditions determine the lifetime.In such use cases, battery cells are cycled identically, making the prediction model task harder, as we show here.
We recommend that researchers carefully consider the use cases when developing lifetime prediction models and select the right features for the right prediction tasks to avoid data leakage.Importantly, benchmarking of lifetime prediction models should be carried out for models designed for the same use cases.Comparing model performance across use cases would be unfair as some models can rely on richer data encoding the protocol-to-protocol variability as demonstrated in this perspective.

Figure 1 .
Figure 1.Machine learning algorithms for battery research can be deployed for various use cases This flowchart illustrates the correspondence between available data and use cases.

Figure 2 .
Figure 2. Voltage and current versus time profiles from one full charge and discharge cycle in the Severson dataset Marked are the temporal regions from which class 1 and class 2 features are generated.

Figure 3 .
Figure 3. Evolution of the accuracies of random forest models based on how much aging data are included as inputs to the models A class 0 model relies solely on cycling conditions and thus does not need any aging data.The shaded areas represent the values within 1 SD, calculated across the 10 train-test splits results.

Table 1 .
Train and test mean absolute percentage errors (MAPEs) for the baseline models

Table 2 .
Errors when models are deployed on identically cycled cells Severson dataset'' contain at least 6 repeats.The mean MAPE and SD are calculated across 10 train-test splits.The overall average lifetime in the entire dataset is 802 cycles with an SD of 378.Class 2 features are more appropriate to capture cell-to-cell variability.