Homology Models of Coronavirus 2019-nCoV 3CLpro Protease

A regional outbreak of pneumonia in Wuhan, Hubei Province of China in late 2019 was associated with a novel coronavirus. Rapid release of genomic data for the isolated virus enabled the construction of first-generation homology models of the new CoV 3CLpro cysteine protease. Whilst the overall viral genome was most closely associated with bat coronaviruses, the main protease is most closely related (96% identity) to SARS CoV protease.

A regional outbreak of pneumonia in Wuhan, Hubei Province of China in late 2019 was associated with a novel coronavirus. Rapid release of genomic data for the isolated virus enabled the construction of first-generation homology models of the new CoV 3CL pro cysteine protease. Whilst the overall viral genome was most closely associated with bat coronaviruses, the main protease is most closely related (96% identity) to SARS CoV protease.
The active site of SARS-CoV 3CL pro is located in the cleft between domains I and II of the protein, and the two domains each contribute one residue to the catalytic dyad composed of His41 and Cys145 (Figure 1). Substrates and inhibitors typically contain Gln residues at the P1 position and large hydrophobic residues Phe/Leu/Met at P2. 15 The wealth of medicinal chemistry and biochemical information gained over the last 18 years, coupled with good-high sequence similarity between the members of the coronavirus family makes the homology modelling of new and emerging viruses an inviting prospect.

Results and Discussion.
The release of the viral genome of the novel Wuhan coronavirus prompted us to construct first-generation homology models of the new CoV 3CL pro . Initial studies and BLAST analysis indicated that the virus was most closely related to a clade of bat coronaviruses (Figure 2), of which the closest neighbour for which there was protease structural information available in the Protein Data Bank (www.rcsb.org) was bat coronavirus HKU4. 16,17 EMBOSS Needle 18 alignment of the Wuhan genome with the sequence extracted from PDB:2YNB identified a putative 306ss viral 3CL pro protein with 49% sequence identity (65% sequence similarity) ( Figure 3). An initial Swissmodel 19 search for possible homology model templates against this search sequence however provided only SARS protease templates with much higher sequence identity and similarity (96, 99% respectively). This was confirmed via a second EMBOSS Needle alignment (Figure 3), and visually by a mapping of differing residues onto the respective crystal structures (Figure 4). . Visualisation of sequence differences between bat HKU4, SARS and Wuhan coronaviruses. Panel A: X-ray crystal structure (PDB:2z9j) of SARS CoV protease shown as green ribbons, with catalytic His41 and Cys145 residues shown as cyan sticks. Residues which differ in the Wuhan sequence are marked in magenta. Panel B: X-ray crystal structure (PDB:2ynb) of bat coronavirus HKU4 protease shown as green ribbons, with residues which differ in the Wuhan sequence are marked in magenta.
Alignment of the SARS 2z9j and HKU4 2ynb protease crystal structures however showed that the two had a very similar overall fold with an RMSD of 0.66Å across 229 well conserved Ca atoms. Accordingly, due to their higher overall sequence identities, SARS crystal structures were chosen for the future development of Wuhan CoV protease homology models. Coronavirus crystal structures are predominantly homodimers, and as a consequence Swissmodel template searches returned some duplicates arising from both chains A&B being selected. Models were built as homodimers where appropriate however for visualisation and initial refinement purposes the monomers were used (Table 1, Figure 5A). Pairs 1&3, 4&5 were functionally identical so only 6 models were taken through for further refinement. The 5 Swissmodel dimer models 1,2,4,6,7 were refined as both dimers and monomers, using Schrodinger Suite 2019 and monomer model 8 was refined without reconstruction to a dimer. The models were overlaid as their monomers and compared by manual inspection. The models were very structurally similar, especially in the active site region near His41 and Cys145 ( Figure 5B). These models were also analysed using 100ns molecular dynamics simulations (see Supporting Information) to check the systems for overall structural stability ( Figure 6). In general the dimeric forms were slightly more stable overall, as the C-and N-termini of the chains form part of the dimer interface and are held more closely than in the monomeric state where they are more mobile. Finally an analysis of the new Wuhan coronavirus genome revealed multiple sites, analogous to those found or proposed for SARS 14 that may be cleaved by the 3CLpro after (L/F/M)-Gln residues. These are mostly identical to the SARS sequences (

Methods.
The Wuhan coronavirus genome was obtained from Genbank/NCBI (release MN908947.1). 4 Sequences from crystal structure sequences were obtained from the Protein Data Bank (rcsb.org). Sequence analysis was carried out using the web interface at the European Bioinformatics Institute (EMBL-EBI). 18 Pairwise sequence alignments were performed using the EMBOSS-Needle method with default EBLOSUM62 matrices. Multiple sequence alignments were carried out using Clustal Omega using default HMM profile-profile parameters. Blast searches were carried out against the full UniProt Knowledgebase using blastp 2.9.0+ and default parameters.
Homology models were prepared using the publicly accessible online Swissmodel [19][20][21][22][23] programs, searching the SWISS-MODEL template library (SMTL version 2020-01-02, and PDB release 2019-12-27), or using a directed user-defined template approach. Coronavirus crystal structures are predominantly homodimers and were modelled as such, but the further refinement was initially performed on the monomeric forms. Newer releases of the Swissmodel software enable the inclusion of bound inhibitors when appropriate. In this case, models derived from PDB:2a5i included a bound covalent inhibitor. The initial heavy atomonly models were further refined using the Protein Preparation module in Schrödinger Suite 2019-2 24 to add hydrogen atoms and optimise the internal hydrogen bonding network. Finally, the models were energy minimised using the OPLS3e force field with charges from the force field and implicit water solvent, and the Polak-Ribier Conjugate Gradient (PRCG) method to gradient <0.05Å. The final minimised models were then analysed using MolProbity 25

Conclusion.
The main protease encoded by the Wuhan fish market coronavirus is very highly homologous to SARS CoV 3CL pro , whereas the virus overall is more highly related to bat coronaviruses. Homology models of the Wuhan coronavirus protease were built from known SARS 3CL pro crystal structures in both ligand-bound and apo forms. These represent potential starting points for structure-based drug design and for suggesting point mutations to inform future biochemical experiments.

Supporting Information
EMBOSS and Clustal Omega sequence alignments, MolProbity statistics and Ramachandran plots for refined monomer and dimer models. PDB files for raw Swissmodel and individual refined monomer and dimer models. Molecular Dynamics simulations of monomer and dimeric models.     The OPLS2005 force field parameters were implemented during simulations. Each monomer, dimer or ternary complex was placed in an orthorhombic box with solvent buffers up to 10Å along each dimension. The TIP3P solvent model was used to describe water molecules.
Overall neutralisation of the system was achieved by adding sodium and chloride ions at the physiological concentration of 0.15M. The prepared model systems were then relaxed by using the multi-step default protocol in Desmond. After relaxation, the whole system was subjected to a 100ns simulation. The cut-off distance for computing short-range electrostatics and Lennard-Jones interaction was set to 9.0Å. The trajectories and energies were recorded every 100 and 1.2ps, respectively. MD simulations were performed at the IMB cluster from the University of Queensland on NVIDIA P100 GPUs or on a CentOS Linux Desktop with a Quadro RTX 5000 GPU. Simulation analysis was performed on Linux or Macintosh desktops using the Simulation Interactions Diagrams module in Schrödinger Suite 2019-2.

Flexible residues
Figure S1-15. WH CoV 3CL pro Model 2 shown as green ribbons, except catalytic dyad (His41, Cys145 -yellow sticks), and magenta ribbons for regions found to be mobile in 100 ns MD simulation.
Model6 from SARS 2a5i WH_human_protease_from2a5i_Minimised.pdb       Apart from the more stable N-and C-termini for both subunits which form part of the dimer interface the remainder of the dimer remains of similar overall flexibility. One major difference is that in the dimeric form the loop 44 CTSEDMLNPN 53 which forms the boundary of the S2-subsite which binds bulky substrate residues Met/Phe/Leu is much less mobile than was seen in the 100ns simulation of the monomer.  Gln residue shown hydrogen bonding to S1 subsite residues Phe140, His163, and Glu166 (blue lines).
The active site of SARS-CoV 3CL pro is located in the cleft between domains I and II of the protein, and the two domains each contribute one residue to the catalytic dyad composed of His41 and Cys145 (Figure 1). Substrates and inhibitors typically contain Gln residues at the P1 position and large hydrophobic residues Phe/Leu/Met at P2. 15 The wealth of medicinal chemistry and biochemical information gained over the last 18 years, coupled with good-high sequence similarity between the members of the coronavirus family makes the homology modelling of new and emerging viruses an inviting prospect.

Results and Discussion.
The release of the viral genome of the novel Wuhan coronavirus prompted us to construct first-generation homology models of the new CoV 3CL pro . Initial studies and BLAST analysis indicated that the virus was most closely related to a clade of bat coronaviruses (Figure 2)    Alignment of the SARS 2z9j and HKU4 2ynb protease crystal structures however showed that the two had a very similar overall fold with an RMSD of 0.66Å across 229 well conserved C atoms. Accordingly, due to their higher overall sequence identities, SARS crystal structures were chosen for the future development of Wuhan CoV protease homology models.
Coronavirus crystal structures are predominantly homodimers, and as a consequence Swissmodel template searches returned some duplicates arising from both chains A&B being selected. Models were built as homodimers where appropriate however for visualisation and initial refinement purposes the monomers were used (Table 1, Figure 5A). Pairs 1&3, 4&5 were functionally identical so only 6 models were taken through for further refinement. The 5 Swissmodel dimer models 1,2,4,6,7 were refined as both dimers and monomers, using Schrodinger Suite 2019 and monomer model 8 was refined without reconstruction to a dimer.
The models were overlaid as their monomers and compared by manual inspection. The models were very structurally similar, especially in the active site region near His41 and Cys145 ( Figure 5B).  These models were also analysed using 100ns molecular dynamics simulations (see Supporting Information) to check the systems for overall structural stability ( Figure 6). In general the dimeric forms were slightly more stable overall, as the Cand N-termini of the chains form part of the dimer interface and are held more closely than in the monomeric state where they are more mobile. Finally an analysis of the new Wuhan coronavirus genome revealed multiple sites, analogous to those found or proposed for SARS 14 that may be cleaved by the 3CLpro after (L/F/M)-Gln residues. These are mostly identical to the SARS sequences (   (Table adapted  The Wuhan coronavirus genome was obtained from Genbank/NCBI (release MN908947.1). 4 Sequences from crystal structure sequences were obtained from the Protein Data Bank (rcsb.org). Sequence analysis was carried out using the web interface at the European Bioinformatics Institute (EMBL-EBI). 18 Pairwise sequence alignments were performed using the EMBOSS-Needle method with default EBLOSUM62 matrices. Multiple sequence alignments were carried out using Clustal Omega using default HMM profile-profile parameters. Blast searches were carried out against the full UniProt Knowledgebase using blastp 2.9.0+ and default parameters.
Homology models were prepared using the publicly accessible online Swissmodel [19][20][21][22][23] programs The main protease encoded by the Wuhan fish market coronavirus is very highly homologous to SARS CoV 3CL pro , whereas the virus overall is more highly related to bat coronaviruses.
Homology models of the Wuhan coronavirus protease were built from known SARS 3CL pro crystal structures in both ligand-bound and apo forms. These represent potential starting points for structure-based drug design and for suggesting point mutations to inform future biochemical experiments.   Overall neutralisation of the system was achieved by adding sodium and chloride ions at the physiological concentration of 0.15M. The prepared model systems were then relaxed by using the multi-step default protocol in Desmond. After relaxation, the whole system was subjected to a 100ns simulation. The cut-off distance for computing short-range electrostatics and Lennard-Jones interaction was set to 9.0Å. The trajectories and energies were recorded every 100 and 1.2ps, respectively. MD simulations were performed at the IMB cluster from the University of Queensland on NVIDIA P100 GPUs or on a CentOS Linux Desktop with a Quadro RTX 5000 GPU. Simulation analysis was performed on Linux or Macintosh desktops using the Simulation Interactions Diagrams module in Schrödinger Suite 2019-2.

Flexible residues
Figure S1-15. WH CoV 3CL pro Model 2 shown as green ribbons, except catalytic dyad (His41, Cys145 -yellow sticks), and magenta ribbons for regions found to be mobile in 100 ns MD simulation.
Over the first 10 ns of the simulation the mobile loop 44 CTSEDMLNPN 53 defining the boundary of the S2 subsite moves away from the bound ligand, and remains distant during the remaining 80ns.

Flexible residues
Figure S1-19. WH CoV 3CL pro Model 6 shown as green ribbons, except catalytic dyad (His41, Cys145 -yellow sticks), and magenta ribbons for regions found to be mobile in the 100 ns molecular dynamics simulation. Covalently bound SARS inhibitor carried through in model from template PDB:2a5i shown as grey sticks.

Flexible residues
Figure S1-21. WH CoV 3CL pro Model 12 Chain A shown as green ribbons, Chain B cyan ribbons, except catalytic dyads (His41, Cys145 -green and cyan sticks), and magenta ribbons for regions found to be mobile in 100 ns MD simulation.
In comparison to the MD simulation of the monomer (Model 2), the dimer shows stable Nand C-termini for both subunits, as these form part of the dimer interface which remains quite stable during the simulation. The very mobile helix and loop region show slight differences in mobility (Chain A Ser46-Ser62, Chain B Thr45-Asp56). In addition there are differences in the N-terminal domains, the Chain B helices Leu227-Met235 and Glu270-Met276 being slightly more mobile than those in Chain A.

Flexible residues
Figure S1-14. WH CoV 3CL pro Model 16 Chain A shown as green ribbons, Chain B cyan ribbons, except catalytic dyads (His41, Cys145 -yellow sticks), and magenta ribbons for regions found to be mobile in 100 ns MD simulation. SARS 3CL pro inhibitor shown as grey sticks.