Development of Machine Learning Models to Predict a Chemical’s Anti-SARS-CoV-2 Activities

04 May 2022, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The COVID-19 pandemic, caused by the coronavirus SARS-CoV-2, has put global health systems at risk, leading to an urgent need for effective treatment for infection of this coronavirus. To accelerate the identification of novel drug candidates for COVID-19 treatment in the drug discovery process, we reported a series of ML-based models to accurately predict the anti-SARS-CoV-2 activities of screening compounds. Those models were trained and evaluated using the experimental data deposited in the COVID-19 OpenData Portal which is hosted by NCATS (https://ncats.nih.gov/expertise/covid19-open-data-portal). We explored 6 popular ML algorithms in combination with 15 molecular descriptors for molecular structures from 9 screening assays. Of note, 6 screening assays of the same datasets were also adopted by KC et al. to construct prediction models which were deployed in the REDIAL-2020 model suite (Nature Machine Intelligence, 3, 527–535, 2021). The impacts of ML algorithms and molecular descriptors on model performance were investigated. As a result, the model constructed using the k-nearest neighbors (KNN) method and the hybrid molecular descriptor, GAFF+RDKit, achieved the best performance. We evaluated the model performance on 28 drugs which have been applied in clinical trails of treating COVID-19. The overall performance of our developed models was better than REDIAL-2020. For the external CPE dataset, 79% of compounds were correctly predicted by using our model, significantly better than REDIAL-2020 (66.7%). For the external 3CL assay, the percentages of correct predictions by our predictors (38.1%) are not as high as REDIAL-2020 (61.9%). However, our models achieved more accurate predictions for the 100 druglike compounds selected as negative control. Furthermore, we reconstructed another 3CL model by utilizing the screen data from the study by Kuzikov, et al. The classification model achieved the best performance on the prediction of positive control, albeit its performance is lower than REDIAL-2020 on the prediction for the negative control. A web server (https://clickff.org/amberweb/covid-19-cp) was developed to enable users to forecast anti-SARS-CoV2 activities of arbitrary compounds. The web portal provides users a fast and reliable way to identify potential compound candidates for COVID-19 treatment, which highly reduces the time and cost of experiments on anti-SARS-CoV activity.

Keywords

Covid-19
anti-SARS-CoV
machine learning

Supplementary materials

Title
Description
Actions
Title
Supplemental Information
Description
The Supplemental Information includes the description of machine learning algorithms, performance metrics, and Figures S1-S2, Tables S1-S7
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.