Abstract
The classification of natural products (NPs) from synthetic molecules (SMs) through machine learning techniques creates knowledge of differentiating features and therefore an impetus for possible research in natural product-based drug design. Natural products generally have a higher chemical diversity and biochemical specificity among other properties, making them favorable as lead structures for drug discovery and differentiating them from synthetic molecules. Here, we propose a machine-learning approach with the PaDEL descriptor software to develop a classification method to differentiate NPs and SMs with a variety of molecular features. An ensemble of supervised learning algorithms, including Logistic Regression, Naive Bayes, Random Forests, and Decision Trees, were tested to obtain the optimal feature importance amongst the molecular descriptors and highest accuracy. The experimental accuracy of the best-performing machine learning method outlined in this paper, Random Forests, reached an 89.19% accuracy, comparable with previous models performing the same classification. Identification and classification of distinguishable properties of natural products and synthetic compounds allows for a better understanding of available chemical data and better incorporation of such properties in small molecule drug discovery.
Supplementary materials
Title
Supporting Information: Key Molecular Descriptors Distinguishing Between Synthetic and Natural Products
Description
Supporting information including dataset parameters, raw code, and supplementary figures
Actions
Supplementary weblinks
Title
Github link to raw code
Description
This contains the raw code used for the analyses presented in the paper.
Actions
View