Predicting SARS-CoV-2 Protein Interactions: Insights from Machine Learning

29 September 2023, Version 1
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

In our research journey, we undertook a comprehensive exploration of protein-protein interaction (PPI) prediction, with a primary focus on unraveling the intricate web of interactions involving the SARS-CoV-2 virus. Our research endeavor encompassed a multi-faceted approach that seamlessly integrated data preprocessing, feature engineering, the application of machine learning models, deep learning techniques, and extensive data visualization to gain profound insights into the complex realm of molecular interactions.The journey commenced with the acquisition of data from the IntAct database, a repository brimming with protein interaction information. However, before diving into analysis, rigorous data preprocessing was imperative. We meticulously scrubbed the data, eliminating undesired identifiers, and harnessed the power of regular expressions to extract and retain only the numeric values crucial for our predictive models. Feature engineering emerged as a pivotal step, allowing us to craft informative data representations conducive to effective model training. Our transformation of the "Confidence Value" variable, extracted from IntAct, into a structured and insightful feature set, was enhanced through one-hot encoding, facilitating our predictive endeavors. Our quest for predictive excellence took two distinct paths: traditional machine learning and deep learning. Utilizing a Random Forest Classifier, we explored classical machine learning methods, achieving commendable accuracy in predicting protein interactions. The introduction of Support Vector Machine (SVM) classifiers further underscored the robustness of our approach, reinforcing its potential in this critical domain. However, the allure of deep learning beckoned, and we ventured into the realm of neural networks. Leveraging TensorFlow and Keras, we meticulously crafted a neural architecture that exhibited remarkable prowess in discerning protein interactions. Our deep learning model unveiled intricate patterns, allowing us to delve deeper into the enigmatic world of PPIs. The culmination of our efforts was showcased through a captivating array of visualizations, each offering unique insights into the realm of protein interactions. Heatmaps artfully painted the picture of confusion matrices, elucidating the strengths and nuances of our predictive models. ROC curves, precision-recall curves, and bar plots masterfully illustrated the intricate interplay between true positives, false positives, and other critical metrics. Our research transcended the realms of technical prowess. It delved into the ethical dimensions of data integration, underscoring the importance of adhering to the highest standards of scientific ethics. We acknowledged the vital need for data privacy and the responsible handling of biological data, anchoring our research in a strong commitment to ethical conduct. In conclusion, our research stands as a testament to the transformative power of computational approaches in deciphering the complex tapestry of protein interactions. It not only provides predictive models but also serves as a blueprint for ethical data management and interpretation. As we bring this journey to a close, we invite fellow researchers to embark on this captivating voyage of discovery, shedding light on the molecular intricacies of SARS-CoV-2 and paving the way for a deeper understanding of host-virus interactions.

Keywords

Protein-Protein Interaction (PPI) Prediction
Machine Learning Models
Deep Learning Approaches
SARS-CoV-2 Interactions
Data Preprocessing
Feature Engineering
Random Forest Classifier
Support Vector Machine (SVM)
ROC Curve Analysis
Confusion Matrix
Precision-Recall Curve
Biological Data Integration
Research on SARS-CoV-2
Computational Biology
Bioinformatics
Preprint Archives
Data Visualization
Research Software
Peer Review
Data Privacy and Security

Supplementary materials

Title
Description
Actions
Title
SARS-COV Protein dataset
Description
This dataset was used during the research to predict the interactions between the human and viral proteins with the help of the m1 score and UniProt ID.
Actions
Title
Negatome dataser- Contains the Negative interacting protein data of COVID19
Description
This dataset contains a negative dataset list taken from various sources. This dataset is also used in the main dataset creation that was used for this research.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.