Machine learning prediction of the most intense peak of the absorption spectra of organic molecules

03 December 2024, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Accurate knowledge of electronic molecular properties of excited states is fundamental for understanding the behavior of functional materials for organic electronics and sensors. In this work, we focus on determining the properties of the most intense peak in the electronic absorption spectra of organic molecules. For this purpose, we employed the quantum chemistry QM-symex dataset, which has approximately 173,000 organic molecules and time-dependent DFT (TD-DFT) data of the first ten electronic absorption transitions. Each one is identified by its Cartesian coordinates. From data in the original QM-symex, we built a new dataset named QM-symex-modif that contains molecules in Simplified Molecular Input Line Entry System (SMILES) format and properties related to the main electronic transition. We then employed twenty machine learning (ML) algorithms to investigate oscillator strengths, excitation energies, transition orbitals, and the highest occupied molecular orbitals (HOMOs). As inputs for the ML algorithms, we used several chemical descriptors for each molecule generated in the RDKit tool employing the corresponding SMILES format. The generated input descriptors significantly improved the accuracy of the ML predictions for these key photophysical properties. Very good mean absolute errors (MAEs) were obtained for the test set composed of 45,056 molecules, namely, an MAE of 0.035 for oscillator strengths, 0.09 eV for excitation energies, 1.24 and 0.62 for the initial and final transition molecular orbital (MO) numbers (i.e., for each molecule, their position in the MO listing) respectively, and 0.014 for HOMO numbers, with R² values consistently exceeding 0.94, thus demonstrating the accuracy of the models. Additionally, a Shapley additive explanation (SHAP) analysis was carried out to evaluate the importance of the input parameters for the investigated ML models. We found several interesting relationships involving the input parameters. In particular, molecular weight holds significant importance in our ML models for determining the target HOMO numbers and the transition orbitals.

Keywords

Machine Learning
Organic Electronics
Sensors
Absorption Maximum Peak
Excited state properties
QM-symex dataset
Simplified Molecular Input Line Entry System (SMILES) Format

Supplementary materials

Title
Description
Actions
Title
Supporting Information.
Description
Supporting information discussed in the text.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.