Homology models of Coronavirus 3CL pro Protease

. A regional outbreak of pneumonia in Wuhan, Hubei Province of China in late 2019 was associated with a novel coronavirus. Rapid release of genomic data for the isolated virus enabled the construction of first-generation homology models of the new CoV 3CL pro cysteine protease. Whilst the overall viral genome was most closely associated with bat coronaviruses, the main protease is most closely related (96% identity) to SARS CoV protease.

Proteases are an important class of drug target, 6 most successfully in the cases of HIV protease 7 and Hepatitis C 8 . Recent human outbreaks of related coronaviruses SARS 9,10 in 2002 and MERS 11 in 2012 have focused the world's attention on the global risks of such emerging viruses. Coronaviruses are positive strand, enveloped RNA viruses with exceptionally large genomes which usually encode two or three viral proteases. In the case of Severe Acute Respiratory Syndrome coronavirus, the two proteases are a papain-like cysteine protease (PL pro ) 12 and a chymotrypsin-like cysteine protease 3CL pro , 13 also known as the Main protease. SARS-CoV 3CL pro is essential for viral replication and is thus recognized as a potential drug target for the treatment of SARS infections. To date there are no clinically used inhibitors of the SARS protease yet they remain in development. 14 Figure 1. X-ray crystal structure of SARS-CoV 3CL pro bound to a covalent inhibitor (PDB code 2a5i). Panel A: Domain I shown as yellow ribbons, domain II cyan ribbons, domain III pale pink ribbons. Catalytic dyad His41, Cys145 shown as yellow sticks, covalent azapeptide inhibitor shown as grey sticks. Panel B: Active site at the interface of domains I, II with P1 Gln residue shown hydrogen bonding to S1 subsite residues Phe140, His163, and Glu166 (blue lines).
The active site of SARS-CoV 3CL pro is located in the cleft between domains I and II of the protein, and the two domains each contribute one residue to the catalytic dyad composed of His41 and Cys145 (Figure 1). Substrates and inhibitors typically contain Gln residues at the P1 position and large hydrophobic residues Phe/Leu/Met at P2. 15 The wealth of medicinal chemistry and biochemical information gained over the last 18 years, coupled with good-high sequence similarity between the members of the coronavirus family makes the homology modelling of new and emerging viruses an inviting prospect.

Results and Discussion.
The release of the viral genome of the novel Wuhan coronavirus prompted us to construct first-generation homology models of the new CoV 3CL pro . Initial studies and BLAST analysis indicated that the virus was most closely related to a clade of bat coronaviruses (Figure 2), of which the closest neighbour for which there was protease structural information available in the Protein Data Bank (www.rcsb.org) was bat coronavirus HKU4. 16,17 EMBOSS Needle 18 alignment of the Wuhan genome with the sequence extracted from PDB:2YNB identified a putative 306ss viral 3CL pro protein with 49% sequence identity (65% sequence similarity) ( Figure 3).  An initial Swissmodel 19 search for possible homology model templates against this search sequence however provided only SARS protease templates with much higher sequence identity and similarity (96, 99% respectively). This was confirmed via a second EMBOSS Needle alignment (Figure 3), and visually by a mapping of differing residues onto the respective crystal structures (Figure 4). Alignment of the SARS 2z9j and HKU4 2ynb protease crystal structures however showed that the two had a very similar overall fold with an RMSD of 0.66Å across 229 well conserved Ca atoms. Accordingly, due to their higher overall sequence identities, SARS crystal structures were chosen for the future development of Wuhan CoV protease homology models. Coronavirus crystal structures are predominantly homodimers, and as a consequence Swissmodel template searches returned some duplicates arising from both chains A&B being selected. Models were built as homodimers where appropriate however for visualisation and initial refinement purposes the monomers were used (Table 1, Figure 5A). Pairs 1&3, 4&5 were functionally identical so only 6 models were taken through for further refinement. The 5 Swissmodel dimer models 1,2,4,6,7 were refined as both dimers and monomers, using Schrodinger Suite 2019 and monomer model 8 was refined without reconstruction to a dimer. The models were overlaid as their monomers and compared by manual inspection. The models were very structurally similar, especially in the active site region near His41 and Cys145 ( Figure 5B).  These models were also analysed using 100ns molecular dynamics simulations (see Supporting Information) to check the systems for overall structural stability ( Figure 6). In general the dimeric forms were slightly more stable overall, as the C-and N-termini of the chains form part of the dimer interface and are held more closely than in the monomeric state where they are more mobile. Finally an analysis of the new Wuhan coronavirus genome revealed multiple sites, analogous to those found or proposed for SARS 14 that may be cleaved by the 3CLpro after (L/F/M)-Gln residues. These are mostly identical to the SARS sequences (Table 2).

Methods.
The Wuhan coronavirus genome was obtained from Genbank/NCBI (release MN908947.1). 4 Sequences from crystal structure sequences were obtained from the Protein Data Bank (rcsb.org). Sequence analysis was carried out using the web interface at the European Bioinformatics Institute (EMBL-EBI). 18 Pairwise sequence alignments were performed using the EMBOSS-Needle method with default EBLOSUM62 matrices. Multiple sequence alignments were carried out using Clustal Omega using default HMM profile-profile parameters. Blast searches were carried out against the full UniProt Knowledgebase using blastp 2.9.0+ and default parameters.
Homology models were prepared using the publicly accessible online Swissmodel [19][20][21][22][23] programs, searching the SWISS-MODEL template library (SMTL version 2020-01-02, and PDB release 2019-12-27), or using a directed user-defined template approach. Coronavirus crystal structures are predominantly homodimers and were modelled as such, but the further refinement was initially performed on the monomeric forms. Newer releases of the Swissmodel software enable the inclusion of bound inhibitors when appropriate. In this case, models derived from PDB:2a5i included a bound covalent inhibitor. The initial heavy atomonly models were further refined using the Protein Preparation module in Schrödinger Suite 2019-2 24 to add hydrogen atoms and optimise the internal hydrogen bonding network. Finally, the models were energy minimised using the OPLS3e force field with charges from the force field and implicit water solvent, and the Polak-Ribier Conjugate Gradient (PRCG) method to gradient <0.05Å. The final minimised models were then analysed using MolProbity 25

Conclusion.
The main protease encoded by the Wuhan fish market coronavirus is very highly homologous to SARS CoV 3CL pro , whereas the virus overall is more highly related to bat coronaviruses. Homology models of the Wuhan coronavirus protease were built from known SARS 3CL pro crystal structures in both ligand-bound and apo forms. These represent potential starting points for structure-based drug design and for suggesting point mutations to inform future biochemical experiments.

Supporting Information
EMBOSS and Clustal Omega sequence alignments, MolProbity statistics and Ramachandran plots for refined monomer and dimer models. PDB files for raw Swissmodel and individual refined monomer and dimer models. Molecular Dynamics simulations of monomer and dimeric models.