Knowledge Discovery from Porous Organic Cages Literature Using a Large Language Model

Yaoyi Su; Siyuan Yang; Yuanhan Liu; Aiting Kai; Linjiang Chen; Ming Liu

doi:10.26434/chemrxiv-2024-jm0ph

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Knowledge Discovery from Porous Organic Cages Literature Using a Large Language Model

23 October 2024, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in text, including authors, affiliations, synthetic procedures, surface area, and the CCDC number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding for cage-related questions.

Keywords

Porous organic cages

Large language models

ChatGPT

Prompt engineering

Supplementary materials

Title

Description

Actions

Title

prompts, source code and detailed test results

Description

The document is organized into several sections: General Information: It mentions the use of GPT-4o for constructing interactive chatbots and provides a URL for raw data and source codes. Prompt Engineering: This section details the prompts used for text categorization into nine categories, such as author information, synthesis details, CCDC numbers, surface area, affiliations, references, etc. It provides examples of paragraphs with their corresponding categories. Python Code: This section includes Python code for paragraph segmentation, text categorization, data extraction, chatbot creation, and calculation of the similarity between GPT-4's answers and reference answers. Detailed Experimental Data: It presents data on text categorization accuracy and the similarity of information extracted by GPT-4 compared to manually extracted information. A table shows the similarity metrics for various categories. References: The document lists references for further reading, including sources on large language models and research related to porous organic cages.

Actions

Supplementary weblinks

Title

Description

Actions

Title

Source code and detailed test results

Description

There are various Python scripts and their functionalities related to processing and categorizing text data, extracting information, and building a chatbot interface. In addition to this, there is raw data with detailed results for each step. This includes all of each category corresponding to each text created from the text categorization session, and a comparison of the detailed results of manual extraction and GPT extraction in the information extraction session, which are both presented in excel table.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Oct 23, 2024 Version 1

Metrics

458

190

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2024-jm0ph

Funding

National Natural Science Foundation of China

22371252

Zhejiang Provincial Natural Science Fund

LZ23B020005

Leading Innovation Team grant from Department of Science and Technology of Zhejiang Province

2022R01005

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Knowledge Discovery from Porous Organic Cages Literature Using a Large Language Model

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share