Theoretical and Computational Chemistry

Development of Machine Learning Models to Predict a Chemical’s Anti-SARS-CoV-2 Activities

Authors

Abstract

The COVID-19 pandemic, caused by the coronavirus SARS-CoV-2, has put global health systems at risk, leading to an urgent need for effective treatment for infection of this coronavirus. To accelerate the identification of novel drug candidates for COVID-19 treatment in the drug discovery process, we reported a series of ML-based models to accurately predict the anti-SARS-CoV-2 activities of screening compounds. Those models were trained and evaluated using the experimental data deposited in the COVID-19 OpenData Portal which is hosted by NCATS (https://ncats.nih.gov/expertise/covid19-open-data-portal). We explored 6 popular ML algorithms in combination with 15 molecular descriptors for molecular structures from 9 screening assays. Of note, 6 screening assays of the same datasets were also adopted by KC et al. to construct prediction models which were deployed in the REDIAL-2020 model suite (Nature Machine Intelligence, 3, 527–535, 2021). The impacts of ML algorithms and molecular descriptors on model performance were investigated. As a result, the model constructed using the k-nearest neighbors (KNN) method and the hybrid molecular descriptor, GAFF+RDKit, achieved the best performance. We evaluated the model performance on 28 drugs which have been applied in clinical trails of treating COVID-19. The overall performance of our developed models was better than REDIAL-2020. For the external CPE dataset, 79% of compounds were correctly predicted by using our model, significantly better than REDIAL-2020 (66.7%). For the external 3CL assay, the percentages of correct predictions by our predictors (38.1%) are not as high as REDIAL-2020 (61.9%). However, our models achieved more accurate predictions for the 100 druglike compounds selected as negative control. Furthermore, we reconstructed another 3CL model by utilizing the screen data from the study by Kuzikov, et al. The classification model achieved the best performance on the prediction of positive control, albeit its performance is lower than REDIAL-2020 on the prediction for the negative control. A web server (https://clickff.org/amberweb/covid-19-cp) was developed to enable users to forecast anti-SARS-CoV2 activities of arbitrary compounds. The web portal provides users a fast and reliable way to identify potential compound candidates for COVID-19 treatment, which highly reduces the time and cost of experiments on anti-SARS-CoV activity.

Content

Thumbnail image of ncat_modeling_chemrxiv.pdf

Supplementary material

Thumbnail image of ncat_modeling_SI_chemrxiv.pdf
Supplemental Information
The Supplemental Information includes the description of machine learning algorithms, performance metrics, and Figures S1-S2, Tables S1-S7

Supplementary weblinks

COVID-19-CP web portal
The web portal provides users a fast and reliable way to identify potential compound candidates for COVID-19 treatment, which highly reduces the time and cost of experiments on anti-SARS-CoV activity.