Knowledge Discovery from Porous Organic Cages Literature Using a Large Language Model

23 October 2024, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in text, including authors, affiliations, synthetic procedures, surface area, and the CCDC number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding for cage-related questions.

Keywords

Porous organic cages
Large language models
ChatGPT
Prompt engineering

Supplementary materials

Title
Description
Actions
Title
prompts, source code and detailed test results
Description
The document is organized into several sections: General Information: It mentions the use of GPT-4o for constructing interactive chatbots and provides a URL for raw data and source codes. Prompt Engineering: This section details the prompts used for text categorization into nine categories, such as author information, synthesis details, CCDC numbers, surface area, affiliations, references, etc. It provides examples of paragraphs with their corresponding categories. Python Code: This section includes Python code for paragraph segmentation, text categorization, data extraction, chatbot creation, and calculation of the similarity between GPT-4's answers and reference answers. Detailed Experimental Data: It presents data on text categorization accuracy and the similarity of information extracted by GPT-4 compared to manually extracted information. A table shows the similarity metrics for various categories. References: The document lists references for further reading, including sources on large language models and research related to porous organic cages.
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.