Using GPT-4 in Parameter Selection of Materials Informatics: Improving Predictive Accuracy Amidst Data Scarcity and 'Ugly Duckling' Dilemma

30 May 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


Materials informatics and cheminformatics struggle with data scarcity, hindering the extraction of significant relationships between structures and properties. The "Ugly Duckling" theorem, suggesting the difficulty of data processing without assumptions or prior knowledge, exacerbates this problem. Current methodologies don't entirely bypass this theorem and may lead to decreased accuracy with unfamiliar data. We propose using Open AI GPT-4 language model for explanatory variable selection, leveraging its extensive knowledge and logical reasoning capabilities to embed domain knowledge in tasks predicting structure-property correlations, such as the refractive index of polymers. This can partially overcome challenges posed by the "Ugly Duckling" theorem and limited data availability.


Materials Informatics
Large language model

Supplementary weblinks


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.