Structural and Functional Annotation of Hypothetical Protein AVO28_00330 of Yersinia pestis: An In Silico Approach

Yersinia pestis is an infamous gram-negative, coccobacillus enterobacterium responsible for three devastating plague pandemics worldwide. The recent outbreak of this zoonotic disease demands in silico study of the hypothetical proteins for efficient drug and vaccine discovery. As hypothetical proteins constitute a substantial portion of the proteome, it’s essential to annotate them structurally and functionally. The current study characterized physicochemical properties, predicted homology-based 3D structure and annotated functions of the hypothetical protein AVO28_00330 of Y. pestis using a range of bioinformatic tools and softwares. Swiss Model and Phyre2 server were utilized to predict the tertiary model which was minimized energetically using YASARA server. The quality assessment servers found the model as a good one. For future molecular docking analysis, active binding sites were predicted using CASTp. Protein-protein interaction analysis was performed in STRING server. For functional prediction InterPro, Pfam, Motif and other tools were used. The hypothetical protein revealed tricopeptide repeat domain and rubredoxin metal-binding domain which regulates lipopolysaccharide metabolic process in the outer cell membrane which contributes to virulence property of the protein. Therefore, this in silico analysis will improve the current understanding of the protein and aid in the future analysis regarding therapeutic drug and vaccine investigation.


Introduction
Yersinia pestis, the etiologic agent of plague and a member of the family Enterobacteriaceae, is a gramnegative, non-spore forming, non-motile coccobacillus that grows within a temperature range of 4 to 400C and optimum pH range of 7.2 to 7.6 [1]. This beyond infamous bacterium is responsible for three devastating pandemics throughout history namely the Justinian's plague, the Black Death and the Modern plague [2]. The plague is zoonotic as it spreads from rodents as a natural reservoir to humans using fleas as the vector [3]. Bites of fleas during blood-meal to humans, direct contact with a mucous membrane or damaged skin, inhalation of aerosolized air droplets cause transmission of the pathogenic bacterium to human [4]. It causes the death of the individual within a week if left untreated for bubonic form and even less than a week for septicemic form and pneumonic form. The rapid development of bacterial biofilm inside the digestive tract of flea helps Y. pestis to adapt to a unique life stage for effective transmission [5]. For its high similarity in the genomic level with Y. pseudotuberculosis, Y. pestis is thought to be a recently emerged clone of it [6], [7]. The USA, the former Soviet Union and Japan developed Y. pestis as a biological weapon during the 20th century [2].
In the 21st century, the plague has been reported from Asia, Africa and America as the pathogenic strain is endemic to animal populations and the recent outbreak in Uganda, the Democratic Republic of Congo, China and Madagascar indicates the major health concern [6]. Though plague has a significant disease history, no highly efficient vaccine with long-lasting support has still been developed. Moreover, the recent emergence of antibiotic-resistant strains poses a serious threat to global public health and biodefense [6], [8], [9]. All these aspects trigger biotechnological interest among the scientists with an integrated in silico approach to study Y. pestis for new drug synthesis and vaccine development.
Hypothetical proteins are predicted or experimentally uncharacterized proteins and they constitute a substantial portion of the proteome of both eukaryotes and prokaryotes [10]. With the remarkable advancement in the field of Next Generation Sequencing (NGS), the number of hypothetical proteins is increasing rapidly and comparing to that experimental validation rate is not so high. This gap of structural and functional annotation can be reduced through in silico approach using modern bioinformatic tools which might pave the way for new drug synthesis and vaccine development. Thus, the current study focuses on annotating AVO28_00330 hypothetical protein of Y. pestis, both structurally and functionally for an improved understanding, which might help later at drug and vaccine development.

Sequence retrieval and similarity identification
The amino acid sequence of AVO28_00330 was retrieved in FASTA format from the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov) with the GenBank accession ID of KZC74892.1. A similarity search using the NCBI Blastp program [11] was performed initially against the non-redundant and UniProtKB/SwissProt [12] database to predict the function of the hypothetical protein.

Multiple sequence alignment and phylogeny analysis
Multiple sequence alignment (MSA) was performed using MUSCLE algorithm in MEGA 10 [13], [14] between the hypothetical protein and other similar proteins obtained from Blastp. MSA was crosschecked by Clustal Omega program of EMBL-EBI [15]. Then phylogeny analysis was done using NEXUS file generated by MEGA into Phylogeny.fr [16].

Physicochemical characterization
Different physical and chemical properties including molecular weight, amino acid composition, atomic position, extinction coefficient, estimated half-life, instability index, aliphatic index, grand average of hydropathicity (GRAVY), isoelectric point, total number of negatively charged residues (Asp + Glu), total number of positively charged residues (Arg + Lys) were predicted using ProtParum tool (http://web.expasy.org/protparam/) of ExPASy [17].

Tertiary structure modeling, visualization and quality assessment
Tertiary structure was modeled using Swiss Model [17] and Phyre2 server [30]. For higher accuracy, the best scoring template was selected for homology modeling. 3D model was visualized using UCSF Chimera [31]. For quality assessment of the obtained models, PROCHECK [32], Verify3D [33] and ERRAT [34] server were utilized. Finally, energy minimization was performed for the best predicted model using YASARA energy minimization server [35].

Active site detection
The active sites were determined using Computed Atlas of Surface Topography of Protein (CASTp) server which provides an online resource for locating, delineating, and measuring concave surface regions on three-dimensional structures of proteins [36].

Protein-protein interaction analysis
STRING 11.5 [45] server was utilized to predict the possible protein-protein functional interaction network.

Submission of the model to protein model database
The suitable model generated for hypothetical protein AVO28_00330 of Y. pestis was successfully submitted to Protein Model Database (PMDB) [46].

Sequence similarity and phylogeny analysis
Blastp result against non-redundant and SwissProt database showed homology with lipopolysaccharide assembly proteins (Table 1). Multiple sequence alignment between the hypothetical protein and other homologous proteins generated NEXUS file in MEGA. To strengthen homology assessment between proteins, down to complex and subunit level, phylogenetic analysis was performed. The phylogenetic tree showed distances between branches and reveals close similarity of the hypothetical protein with WP_046596310.1 Y. pestis homolog while distantly related with NLU15143.1 Serratia liquefaciens (Fig. 1).

Physicochemical characterization
The protein AVO28_00330 was predicted to contain 389 amino acids where Leu (51) and Trp (5) are most abundant and least abundant, respectively ( Table 2). The molecular weight was calculated as 44336.69 Da and theoretical pI was 5.94, indicating the protein to be acidic and negatively charged. Total number of positively charged residues (Arg + Lys) and total number of negatively charged residues (Asp + Glu) were found 46 and 54, respectively. The instability index was 41.32, indicating the unstable nature of the protein [47]. Aliphatic index was 88.12 which gives an indication of proteins stability over a wide temperature range. The GRAVY was -0.366, which indicates the protein is nonpolar and hydrophilic. This also indicates better interaction possibility with water [48]. High extinction coefficient value (47370) indicates the presence of Cys, Trp and Tyr residues [49]. The N-terminal of the sequence was considered M (Met). Protein half-life is an estimation of the period of time which is required for the radiolabeled focus protein density to be decreased by 50 percent compared to the amount at the onset of the chase [50]. Estimated half-life was found to be 30 h in mammalian reticulocytes (in vitro), >20 h in yeast (in vivo), >10 h in Escherichia coli (in vivo). Total number of atoms and molecular formula were 6189 and C1942H3082N562O579S24 , respectively.

Subcellular localization
Subcellular localization prediction helps to characterize a protein as a potential drug or vaccine candidate. Cytoplasmic matrix proteins are capable to be selected as potential drug targets and both inner and outer membrane proteins can act as potential vaccine targets [51]. CELLO 2.5 predicted the protein to be localized into the cytoplasm and the result was validated by PSORTb, PSLpred and SOSUIGramN ( Table 3). The protein was predicted as soluble protein by Protein-Sol. Prediction of signal peptide is important to understand the transport system and cleavage sites of the hypothetical protein. Signal peptide was detected by both PrediSi and SignalP-5.0. However, no transmembrane helices were detected using HMMTOP, TMHMM and SABLE which further emphasizes the protein to be cytoplasmic.

Secondary structure prediction
Considering default parameters SOPMA server was utilized first. SOPMA predicted 25.71% residues as random coils in comparison to alpha-helix (68.12%), extended strand (2.31%) and beta turn (3.86%) ( Table 4). PSIPRED also predicted similar result showing higher confidence (Fig. 2-3). Secondary structure helps to understand function of the protein better as strong correlation exists between protein structure and function.

Tertiary structure modeling, visualization and quality assessment
Homology modeling approach was taken for determining the tertiary structure of the hypothetical protein. Swiss Model server predicted the 3D structure (Fig. 4) based on the most favored template 4zlh.1.B (PDB ID: 4ZLH_B). 4ZLH is the crystal structure of Escherichia coli protein with lipopolysaccharide assembly protein B (LapB) cytoplasmic domain. This template protein is a homodimer which has two chains (Chain A and Chain B) and chain B was used to build the model by Swiss Model server. For this template, values of Global Model Quality Estimation (GMQE), Quaternary Structure Quality Estimation (QSQE) and identity score were 0.83, 0.51 and 76.04, respectively. The quality of the model was assessed by PROCHECK through Ramachandran plot analysis, where the distribution of ψ angle and the φ angle in the model within the limits are shown (Fig. 5, Table 5). Residues in the most favored regions covered 93.1% which indicated good quality and validity of the model. Verify3D showed 94.69% of the residues have averaged 3D-1D score ≥ 0.2 (Fig. 6), which indicates good quality of the environmental profile for the predicted model. The overall quality factor predicted by ERRAT server was 98.752, which validates the model as a good one. Similarly, Phyre2 server predicted the 3D model with 100% confidence and 87% coverage. 337 residues out of 389 were modeled with 100% confidence by the single highest scoring template which was c4zlhB (PDB ID: 4ZLH_B). PROCHECK predicted 91.0% residues in the most favored regions, which indicates good confidence for the predicted model (Table 5). 87.33% of the residues had averaged 3D-1D score ≥ 0.2, according to Verify3D which validates the predicted model. ERRAT server quality factor score was 89.6024, which is suggestive of a good valid model. The tertiary structure modeled by Swiss Model was more preferable than the model predicted by Phyre2 server considering Ramachandran map analysis, Verify3D results and ERRAT server results. Therefore, energy minimization was performed using YASARA server for the Swiss Model 3D structure and scene file (.sce) was visualized in YASARA scene. The energy calculated before energy minimization was -338180.6 kJ/mol and that was changed to a far less value of -431656.0 kJ/mol after energy minimization which makes the predicted model more stable.

Active site determination
CASTp 3.0 predicted 52 amino acids to be involved in the potent active sites (Fig. 7). The best active site was found in areas with 968.799 and a volume of 3258.076 amino acids (Fig. 8).    ScanProsite and Motif also showed the presence of tricopeptide repeat domain and rubredoxin metalbinding domain. Rubredoxin helps to form small non-heme iron-binding sites that use four cysteine residues to coordinate a single metal ion in a tetrahedral environment. The main feature of this domain is the extended loop or knuckles. Rubredoxin domain binds intimately with tricopeptide motif and this association is essential for lipopolysaccharide regulation and growth into bacterial cells [52]. Lipopolysaccharide at the outer membrane of the cell wall contributes significantly to the pathogenicity of Y. pestis as it enables the bacterium with unique ability to overcome the defense mechanism of both mammalian and insect hosts as well as antibiotics by using lipid A as an anchor to keep the LPS bounded to the membrane whereas orienting its carbohydrate chain towards the environment [4]. After amino acid composition based analysis, VirulentPred suggested this protein as virulent. The globin-like folding pattern was predicted by PFP-FunDSeqE. InterPro server predicted TPR (tricopeptide repeat)-like superfamily for the hypothetical protein. All these results confirm the role of the protein in the metabolic process of lipopolysaccharides, a group of related, structurally complex components of the outer membrane of gram-negative bacteria.

Protein-protein interaction analysis
Protein-protein interactions (PPI) play a crucial role in basic processing of living cells. PPI data can provide deep insights to reveal molecular machinery for our better understanding of the mechanism of diseases [53]. STRING 11.5 server was used to search for the possible functional fellows of the hypothetical protein in the PPI network. The identified functional partners with scores were-lapA (0.973), cutC (0.634), pgpB (0.616), asmB (0.573), rlpB (0.548), hemX (0.546), lpxH (0.536), ftsH (0.527), YPO3362 (0.520), yfiO (0.515). Of them, YPO3362 is essential cell division protein, yfiO is a part of the outer membrane protein assembly complex, ftsH is a processive protein in the quality control of integral membrane proteins, lpxH is lipid A biosynthesizer, hemX is a methyltransferase, rlpB is a lipopolysaccharide assembler, asmB is involved in lipid A biosynthesis, pgpB is phosphatidyglycerophosphatase B like protein, cutC is involved in the control of copper homeostasis and lapA is involved in the assembly of lipopolysaccharide (Fig. 9) Fig. 9. String network protein-protein interaction analysis showing the functional partners of lapB.

Submission to Protein Model Database (PMDB)
The predicted 3D model of AVO28_00330 hypothetical protein of Yersinia pestis was successfully submitted to the PMDB database. The model can be found searching PMDB ID: PM0084191.

Conclusion
The current study was directed to create the first 3D structure and propose probable functions of the Yersinia pestis hypothetical protein AVO28_00330. It was submitted as the new record to the protein model database. The identified protein revealed its essential role in the regulation of the lipopolysaccharide metabolic process of the bacterium cell using globin-like folding pattern. Predicted active binding sites of the homology modeled protein would be helpful for further investigation of therapeutic drug designing against the protein using the molecular docking approach. The physicochemical, structural and functional annotation would provide a better understanding of the protein's activity. This sort of methodology would be helpful in the structural and functional elucidation of other uncharacterized proteins. Finally, in vitro experimentation should be conducted to validate the predicted results that are shown here and to annotate the protein's role in biotechnology.