Analysis of Whole Genome Sequences and Homology Modelling of a 3-C Like Peptidase and a Non-Structural Protein of the Novel Coronavirus COVID-19 Shows Protein Ligand Interaction with an Aza-Peptide and a Noncovalent Lead Inhibitor with Possible Antiviral Properties

11 The family of viruses belonging to Coronaviridae mainly consist of virulent pathogens that have a zoonotic property, Severe 12 Respiratory Syndrome (SARS-CoV) and Middle East Respiratory Syndrome (MERS-CoV) of this family have emerged before 13 and now the Novel COVID-19 has emerged in China. Characterization of spike glycoproteins, polyproteins and other viral 14 proteins from viruses are important for vaccine development. Homology modelling of these proteins with known templates 15 offers the opportunity to discover ligand binding sites and possible antiviral properties of these protein ligand complexes. Any 16 information emerging from these protein models can be used for vaccine development. In this study we did a complete 17 bioinformatic analysis, sequence alignment, comparison of multiple sequences and homology modelling of the Novel COVID18 19 whole genome sequences, the spike protein and the polyproteins for homology with known proteins, we also analysed 19 receptor binding sites in these models for possible vaccine development. Our results showed that the tertiary structure of the 20 polyprotein isolate COVID-19 _HKU-SZ-001_2020 had 98.94 percent identity with SARS-Coronavirus NSP12 bound to 21 NSP7 and NSP8 co-factors. Our results indicate that a part of the viral genome (residues 254 to 13480 in Frame 2 with 4409 22 amino acids) of the Novel COVID-19 virus isolate Wuhan-Hu-1 (Genbank Accession Number MN908947.3) when modelled 23 with template 2a5i of the PDB database had 96 percent identity with a 3C like peptidase of SARS-CoV which has ability to 24 bind with Aza-Peptide Epoxide (APE) which is known for irreversible inhibition of SARS-CoV main peptidase. The part of 25 the genome when modelled with template 3e9s of the PDB database had 82 percent identity with a papain-like 26 protease/deubiquitinase which when complexed with ligand GRL0617 acts as inhibitor which can block SARS-CoV 27 replication. It is possible that these viral inhibiters can be used for vaccine development for the Novel COVID-19. 28 29 Introduction 30 More than a decade has passed since the emergence human Coronavirus that caused Severe 31 Respiratory Syndrome (SARS-CoV) and it is about 7 years since the emergence of another 32 type of Coronavirus Middle East Respiratory Syndrome (MERS-CoV) and now the Novel 33

More than a decade has passed since the emergence human Coronavirus that caused Severe   Coronaviruses are RNA viruses and have large genomes structures and due to this they can 62 have high error in replication as compared to host genomes. It is also known that various CoVs 63 can do effective recombination of their genomes after infecting host cells (Luo et al 2018). This  approach. In the case of SARS-CoV, these proteins can mediate binding of the virus with its 74 receptor and promotes the fusion between the viral and host cell membranes and virus entry 75 into the host cell, hence peptides, antibodies, organic compounds and short interfering RNAs 76 that interact with the spike protein can have a potential role in vaccine development (Du et al 77 2009). 78 Here in this study we did a complete bioinformatic analysis, sequence alignment, comparison 79 of multiple sequences of the Novel COVID-19 whole genome sequences, the Spike protein and 80 the polyproteins for homology with known spike proteins and also analysed receptor binding 81 sites for possible vaccine development.

84
Six complete viral genome sequences, seven polyproteins (RdRp region) and seven 85 glycoproteins available on NCBI portal on 4 Feb 2020 were taken for analysis. The sequence 86 details and GenBank accession numbers are listed in Table 1 The available polyproteins (RdRp region) and glycoprotein sequences were retrieved from 93 Genbank, NCBI (Benson et al., 2000). These sequences were translated to amino acid 94 sequences using sorted six frame translation with Bioedit (Hall et al., 2011). Multiple sequence 95 alignment of the translated protein sequences was performed and phylogenetic tree was 96 constructed using Mega-X (Kumar et al., 2018). The alignment shows that amongst the seven 97 polyproteins, five sequences were identical being from the same isolate and two other 98 sequences of the other isolate are identical. Similar analysis of the seven glycoproteins was 99 done, all the seven glycoprotein sequences were found to be identical. Therefore, further Expasy proteomics server (Gasteiger et al., 2003) was used to study the protein sequence and 109 structural details. These peptides were studied for their physio-chemical properties using the 110 tool Protparam (Gasteiger et al., 2005). The secondary structure analysis was done using Chou 111 and Fasman algorithm with CFSSP (Kumar, 2013). To generate the 3D structure from the fasta 112 sequence, homology modelling was performed and the templates were identified. The model 113 was built using the template with highest identity. The structural assessment was also 114 performed to validate the model built. Swiss-model (Schwede et al., 2003)  The phylogenetic tree of the seven polyproteins is shown in Fig.2. It is seen that two 135 polyproteins were distinctly different from the rest. The tertiary structure analysis of the isolate COVID-19 _HKU-SZ-001_2020 ORF1ab 139 polyprotein is given in The Phylogenetic tree of the seven glycoproteins of the Wuhan seafood market pneumonia 143 virus isolate is shown in Fig.3, it is seen that the glycoproteins are similar in all the isolates.

171
The isolates SI200040-SP orf1ab polyprotein and the isolate SI200121-SP orf1ab polyprotein 172 had 2 reading frames as compared to the rest of the isolates which had 3 reading frames. The 173 presence of multiple reading frames suggests the possibility of overlapping genes as seen in 174 many virus and prokaryotes and mitochondrial genomes. This could affect how the proteins 175 are made. The number of amino acid residues in all the polyproteins were the same expect one 176 isolate SI200040-SP which had one amino acid more than the other polyproteins. The 177 extinction coefficients of the two isolates SI200040-SP orf1ab polyprotein and the isolate 178 SI200121-SP orf1ab polyprotein was much higher compared to the rest of the polyproteins.

13
The extinction coefficient is important when studying protein-protein and protein-ligand 180 interactions. The instability index of these two isolates was also high when compared to the 181 others indicating the that these two isolates are instable. Regulation of gene expression by 182 polyprotein processing is known in viruses and this is seen in many viruses that are human 183 pathogens (Yost et al 2013).

184
The isolates here like many other viruses may be using replication strategy which could involve  The model with template 3e9s of the PDB database shows that the Coronavirus viral protein 251 can have a ligand which is a papain-like protease (PLpro) that is known to be a potent inhibitor 252 of viral replication in SARS (Ratia et al 2008).

253
The two parts of the Main protein from the whole genome of the Novel Coronavirus COVID-254 19 aligned with two SAR proteins and the ligand binding sites were similar, the alignment 255 positions, number of amino acids and ligand and the interacting residues is given in Table 3 256 The main protein with a sequence length of 5509aa of the Wuhan Corona Virus showing 257 structural alignment with two other proteins of SARS-CoV is given in Table 4 258 259 260 19   Lead Inhibitor is given in Suppl. Table 3, the Hydrophobic interaction, hydrogen bonding, π-

284
Stacking of the template 3e9s is given in Suppl. 2a5i. This shows that there is high possibility of binding of the these antiviral compounds with 287 the regions of Novel Coronavirus protein that is in homology with the SARS protein. and is known to inhibit the papainlike protease that is present in SARS CoV . This protease is 303 a potential target for antiviral compounds (Chaudhuri et al., 2011). We found the Novel 304 COVID-19 has homology with this and the binding sites for this in the structural protein of the 305 Novel COVID-19 is the same (

Introduction
More than a decade has passed since the emergence human Coronavirus that caused Severe Respiratory Syndrome (SARS-CoV) and it is about 7 years since the emergence of another Human to human transmission on this virus has been a concern and due to this search for  Here in this study we did a complete bioinformatic analysis, sequence alignment, comparison of multiple sequences of the Novel COVID-19 whole genome sequences, the Spike protein and the polyproteins for homology with known spike proteins and also analysed receptor binding sites for possible vaccine development.

Materials and Methods
Six complete viral genome sequences, seven polyproteins (RdRp region) and seven glycoproteins available on NCBI portal on 4 Feb 2020 were taken for analysis. The sequence details and GenBank accession numbers are listed in Table 1 The available polyproteins (RdRp region) and glycoprotein sequences were retrieved from Genbank, NCBI (Benson et al., 2000). These sequences were translated to amino acid sequences using sorted six frame translation with Bioedit (Hall et al., 2011). Multiple sequence alignment of the translated protein sequences was performed and phylogenetic tree was constructed using Mega-X (Kumar et al., 2018). The alignment shows that amongst the seven polyproteins, five sequences were identical being from the same isolate and two other sequences of the other isolate are identical. Similar analysis of the seven glycoproteins was done, all the seven glycoprotein sequences were found to be identical. Therefore, further analysis was carried out for three sequences. Expasy proteomics server (Gasteiger et al., 2003) was used to study the protein sequence and structural details. These peptides were studied for their physio-chemical properties using the tool Protparam (Gasteiger et al., 2005). The secondary structure analysis was done using Chou and Fasman algorithm with CFSSP (Kumar, 2013). To generate the 3D structure from the fasta sequence, homology modelling was performed and the templates were identified.
The model was built using the template with highest identity. The structural assessment was Structural information is extracted from the template, sequence alignment is used to define insertions and deletions.
Protein ligand interaction profile with hydrogen bonding, hydrophobic interactions, salt bridges and π-Stacking was done with PLIP server (Salentin et al., 2015) Results and Discussion T he phylogenetic tree of the seven polyproteins is shown in Fig.2. It is seen that two polyproteins were distinctly different from the rest. The tertiary structure analysis of the isolate COVID-19 _HKU-SZ-001_2020 ORF1ab polyprotein is given in Table 2. It is seen that the polyprotein has a 98.94 percent identity with PDB structure 6nur.1.A and a 19.74 percent identity with a ABC-type uncharacterized transport system periplasmic component-like protein.
The Phylogenetic tree of the seven glycoproteins of the Wuhan seafood market pneumonia virus isolate is shown in Fig.3, it is seen that the glycoproteins are similar in all the isolates.  The polyprotein is an RNA directed RNA polymerase. The protein is identical to the SARS-Coronavirus NSP12 bound to NSP7 and NSP8 co-factors (Kirchdoerfer and Ward 2019). In SARS it is basically a nonstructural protein with NSP12 being the RNA dependent RNA polymerase and the co factors NSP 7 and NSP 8 having the function of forming hexadecameric complex es and also act as processivity clamp for RNA polymerase and primase (Fehr et al., 2016).  This protein as in SARS virus may be involved in the assembly of the coronavirus core RNAsynthesis machinery. This polyprotein can be taken as a template to design antiviral compounds. The polyprotein also has an identity of 19.74 percent with an ABC-type uncharacterized transport system periplasmic component-like protein, this protein is known to be a substrate binding protein and possible binding can be explored here (Bae et al 2019).

Multiple alignment of the Polyproteins of the Novel Coronavirus COVID -19 is shown in
The primary structure parameters of the 7 polyproteins RdRp region of the Wuhan seafood market pneumonia virus isolate is given in Supplementary Table 3. RdRP forms an important part of the viral genome where in the RNA viruses its function is to catalyze the synthesis of the RNA strand complementary to a given RNA template. The isolates SI200040-SP orf1ab polyprotein and the isolate SI200121-SP orf1ab polyprotein had 2 reading frames as compared to the rest of the isolates which had 3 reading frames. The presence of multiple reading frames suggests the possibility of overlapping genes as seen in many virus and prokaryotes and mitochondrial genomes. This could affect how the proteins are made. The number of amino acid residues in all the polyproteins were the same expect one isolate SI200040-SP which had one amino acid more than the other polyproteins. The extinction coefficients of the two isolates SI200040-SP orf1ab polyprotein and the isolate SI200121-SP orf1ab polyprotein was much higher compared to the rest of the polyproteins.
The extinction coefficient is important when studying protein-protein and protein-ligand interactions. The instability index of these two isolates was also high when compared to the others indicating the that these two isolates are instable. Regulation of gene expression by polyprotein processing is known in viruses and this is seen in many viruses that are human pathogens (Yost et al 2013).
The isolates here like many other viruses may be using replication strategy which could involve the translation of a large polyprotein with subsequent cleavage by viral proteases.
The two isolates SI200040-SP orf1ab polyprotein and the isolate SI200121-SP orf1ab polyprotein also showed shorter half-lives as compared to the other isolates indicating that they are susceptible to enzymatic degradation.  The two parts of the Main protein from the whole genome of the Novel Coronavirus COVID-19 aligned with two SAR proteins and the ligand binding sites were similar, the alignment positions, number of amino acids and ligand and the interacting residues is given in Table 3 The main protein with a sequence length of 5509aa of the Wuhan Corona Virus showing structural alignment with two other proteins of SARS-CoV is given in Table 4 18    Table   1, the Hydrophobic interaction, hydrogen bonding, salt bridges of the template 2a5i is given in Suppl. Table 2, when comparing both it is seen that the binding properties are the same expect for the presence of water bridge in the template 2a5i.
The Hydrophobic interaction, hydrogen bonding, π-Stacking of the constructed model of the Novel Coronavirus protein from region 1568-1882 aa to ligand Small molecule Noncovalent Lead Inhibitor is given in Suppl. Table 3, the Hydrophobic interaction, hydrogen bonding, π-Stacking of the template 3e9s is given in Suppl. Table 4, when comparing both it is seen that the binding properties are the same except or an addition π-Stacking at Tyr in the template 2a5i. This shows that there is high possibility of binding of the these antiviral compounds with the regions of Novel Coronavirus protein that is in homology with the SARS protein. and is known to inhibit the papainlike protease that is present in SARS CoV . This protease is a potential target for antiviral compounds (Chaudhuri et al., 2011). We found the Novel COVID-19 has homology with this and the binding sites for this in the structural protein of the Novel COVID-19 is the same (Table 4). This compound inhibits the enzyme that is