Identification of Enzymatic Active Sites with Unsupervised Language Modeling

Loïc Kwate Dassi; Matteo Manica; Daniel Probst; Philippe Schwaller; Yves Gaetan Nana Teukam; Teodoro Laino

doi:10.26434/chemrxiv-2021-m20gg-v2

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Identification of Enzymatic Active Sites with Unsupervised Language Modeling

07 January 2022, Version 2

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The first decade of genome sequencing saw a surge in the characterization of proteins with unknown functionality. Even still, more than 20% of proteins in well-studied model animals have yet to be identified, making the discovery of their active site one of biology's greatest puzzle. Herein, we apply a Transformer architecture to a language representation of bio-catalyzed chemical reactions to learn the signal at the base of the substrate-active site atomic interactions. The language representation comprises a reaction simplified molecular-input line-entry system (SMILES) for substrate and products, complemented with amino acid (AA) sequence information for the enzyme. We demonstrate that by creating a custom tokenizer and a score based on attention values, we can capture the substrate-active site interaction signal and utilize it to determine the active site position in unknown protein sequences, unraveling complicated 3D interactions using just 1D representations. This approach exhibits remarkable results and can recover, with no supervision, 31.51% of the active site when considering co-crystallized substrate-enzyme structures as a ground-truth, vastly outperforming approaches based on sequence similarities only. Our findings are further corroborated by docking simulations on the 3D structure of few enzymes. This work confirms the unprecedented impact of natural language processing and more specifically of the Transformer architecture on domain-specific languages, paving the way to effective solutions for protein functional characterization and bio-catalysis engineering.

Keywords

Deep Learning Applications

Chemical Language Modeling

Green Chemistry

Molecular Transformer

RXN

Active Sites

Enzymatic Reactions

Protein Language Modeling

Interpretability

Supplementary weblinks

Title

Description

Actions

Title

GitHub repository

Description

Implementation of tokenizers as well as model training and inference

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.