Abstract
The results of the comparison of two big chemical databases PubChem-Compound (general chemical information) and Aurora Fine Chemical (commercially available compounds) are presented. Each database contains about 115M records. The performed comparison shows that about 50% of the structures in these databases are identical while the rest are unique to each database, which is unexpected for such large samples. The PubChem compounds database contains many structures that are unstable at room temperature. For a more detailed comparison, the chemical structures are decomposed into circular fragments with a radius of up to 3 chemical bonds. It was found that the PubChem compounds database contains 1.5 times more fragments, than AFC. It is explained that the average size of a chemical structure in PubChem Compounds is larger than in Aurora Fine Chemicals. We also find that among the 30 most distributed fragments, 28 are common to both databases. Analysis of the unique fragments allowed us to find structures that are poorly represented in each of the databases. These are triple-bonded compounds in the PubChem Compounds database and organosilicon compounds in the Aurora Fine Chemicals database. Our study provides important information for the development of in silico engineering approaches for new polymers based on virtual synthesis from simple chemical fragments.