Comparison of PubChem Compound and Aurora Fine Chemical Large-Scale Chemistry Databases

13 March 2025, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The results of the comparison of two big chemical databases PubChem-Compound (general chemical information) and Aurora Fine Chemical (commercially available compounds) are presented. Each database contains about 115M records. The performed comparison shows that about 50% of the structures in these databases are identical while the rest are unique to each database, which is unexpected for such large samples. The PubChem compounds database contains many structures that are unstable at room temperature. For a more detailed comparison, the chemical structures are decomposed into circular fragments with a radius of up to 3 chemical bonds. It was found that the PubChem compounds database contains 1.5 times more fragments, than AFC. It is explained that the average size of a chemical structure in PubChem Compounds is larger than in Aurora Fine Chemicals. We also find that among the 30 most distributed fragments, 28 are common to both databases. Analysis of the unique fragments allowed us to find structures that are poorly represented in each of the databases. These are triple-bonded compounds in the PubChem Compounds database and organosilicon compounds in the Aurora Fine Chemicals database. Our study provides important information for the development of in silico engineering approaches for new polymers based on virtual synthesis from simple chemical fragments.

Keywords

large-scale chemical database
chemical structure
structure decomposition
polymers

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.