Theoretical and Computational Chemistry

Machine Learning the Redox Potentials of Phenazine Derivatives: A Comparative Study on Molecular Features

Siddharth Ghule CSIR-National Chemical Laboratory (CSIR-NCL), Pune


Electricity generation is a major contributing factor for greenhouse gas emissions. Energy storage systems available today have a combined capacity to store less than 1% of the electricity being consumed worldwide. Redox Flow Batteries (RFBs) are promising candidates for green and efficient energy storage systems. RFBs are being used in renewable energy systems, but their widespread adoption is limited due to high production costs and toxicity associated with the transition-metal-based redox-active species. Therefore, cheaper and greener alternative organic redox-active species are being investigated. Recent reports have shown organic molecules based on phenazine are promising candidates for redox-active species in RFBs. However, the large number of available organic compounds makes the conventional experimental and DFT methods impractical to screen thousands of molecules in a reasonable amount of time. In contrast, machine-learning models have low development time, short prediction time, and high accuracy; thus, are being heavily investigated for virtual screening applications. In this work, we developed machine-learning models to predict the redox potential of phenazine derivatives in DME solvent using a small dataset of 185 molecules. 2D, 3D, and Molecular Fingerprint features were computed using readily available and easy-to-use python libraries, making our approach easily adaptable to similar work. Twenty linear and non-linear machine-learning models were investigated in this work. These models achieved excellent performance on the unseen data (i.e., R2 > 0.98, MSE < 0.008 V2 and MAE < 0.07 V). Model performance was assessed in a consistent manner using the training and evaluation pipeline developed in this work. We showed that 2D molecular features are most informative and achieve the best prediction accuracy among four feature sets. We also showed that often less preferred but relatively faster linear models could perform better than non-linear models when the feature set contains different types of features (i.e., 2D, 3D, and Molecular Fingerprints). Further investigations revealed that it is possible to reduce the training and inference time without sacrificing prediction accuracy by using a small subset of features. Moreover, models were able to predict the previously reported promising redox-active compounds with high accuracy. Also, significantly low prediction errors were observed for the functional groups. Although some functional groups had only one compound in the training set, best-performing models could achieve errors (MAPE) less than 10%. The major source of error was a lack of data near-zero and in the positive region. Therefore, this work shows that it is possible to develop accurate machine-learning models that could potentially screen millions of compounds in a short amount of time with a small training set and limited number of easy to compute features. Thus, results obtained in this report would help in the adoption of green energy by accelerating the field of materials discovery for energy storage applications.

Version notes

Draft-8 of the original manuscript


Thumbnail image of MANUSCRIPT (draft-8, redox_potential).pdf
download asset MANUSCRIPT (draft-8, redox_potential).pdf 2 MB [opens in a new tab]

Supplementary material

Thumbnail image of SI (redox_potential).pdf
download asset SI (redox_potential).pdf 1 MB [opens in a new tab]
SI (redox potential)