Abstract
Porous organic cages (POCs) are an emerging subclass of porous materials, drawing increasing attention due to their structural tunability, modularity and processibility, with the research in this area rapidly expanding. Nevertheless, it is a time-consuming and labour-intensive process to obtain sufficient information from the extensive literature on organic molecular cages. This article presents a GPT-4-based literature reading method that incorporates multi-label text classification and a follow-up information extraction, in which the potential of GPT-4 can be fully exploited to rapidly extract valid information from the literature. In the process of multi-label text classification, the prompt-engineered GPT-4 demonstrated the ability to label text with proper recall rates according to the type of information contained in text, including authors, affiliations, synthetic procedures, surface area, and the CCDC number of corresponding cages. Additionally, GPT-4 demonstrated proficiency in information extraction, effectively transforming labeled text into concise tabulated data. Furthermore, we built a chatbot based on this database, allowing for quick and comprehensive searching across the entire database and responding for cage-related questions.
Supplementary materials
Title
prompts, source code and detailed test results
Description
The document is organized into several sections:
General Information: It mentions the use of GPT-4o for constructing interactive chatbots and provides a URL for raw data and source codes.
Prompt Engineering: This section details the prompts used for text categorization into nine categories, such as author information, synthesis details, CCDC numbers, surface area, affiliations, references, etc. It provides examples of paragraphs with their corresponding categories.
Python Code: This section includes Python code for paragraph segmentation, text categorization, data extraction, chatbot creation, and calculation of the similarity between GPT-4's answers and reference answers.
Detailed Experimental Data: It presents data on text categorization accuracy and the similarity of information extracted by GPT-4 compared to manually extracted information. A table shows the similarity metrics for various categories.
References: The document lists references for further reading, including sources on large language models and research related to porous organic cages.
Actions
Supplementary weblinks
Title
Source code and detailed test results
Description
There are various Python scripts and their functionalities related to processing and categorizing text data, extracting information, and building a chatbot interface.
In addition to this, there is raw data with detailed results for each step. This includes all of each category corresponding to each text created from the text categorization session, and a comparison of the detailed results of manual extraction and GPT extraction in the information extraction session, which are both presented in excel table.
Actions
View