Daniel Probst IBM Research Europe
Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the use of enzymes on unreported substrates and their specific stereo- and regioselectivity are domain-specific knowledge factors that require decades of field experience to master. This makes the retrosynthesis of given targets with biocatalysed reactions a significant challenge. Here, we use the molecular transformer architecture to capture the latent knowledge about enzymatic activity from a large data set of publicly available biochemical reactions, extending forward reaction and retrosynthetic pathway prediction to the domain of biocatalysis. We introduce the use of a class token based on the EC classification scheme that allows to capture catalysis patterns among different enzymes belonging to the same hierarchical families. The forward prediction model achieves an accuracy of 49.6% and 62.7%, top-1 and top-5 respectively, while the single-step retrosynthetic model shows a round-trip accuracy of 39.6% and 42.6%, top-1 and top-10 respectively. Trained models and curated data are made publicly available with the hope of promoting enzymatic catalysis and making green chemistry more accessible through the use of digital technologies.
download asset GreenCatRXN.pdf 6 MB [opens in a new tab] cloud_download
pdf : 6 MB
download asset schema-web.png 0.17 MB [opens in a new tab] cloud_download
png : 0.17 MB
download asset toc.png 0.12 MB [opens in a new tab] cloud_download
png : 0.12 MB