ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
DECIMER.pdf (2.75 MB)

DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers

preprint
submitted on 24.04.2021, 09:23 and posted on 29.04.2021, 07:20 by Kohulan Rajan, Achim Zielesny, Christoph Steinbeck

The amount of data available on chemical structures and their properties has increased exponentially over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, optical chemical structure recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based.

The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50-100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.


History

Email Address of Submitting Author

kohulan.rajan@uni-jena.de

Institution

Friedrich-Schiller-University Jena

Country

Germany

ORCID For Submitting Author

0000-0003-1066-7792

Declaration of Conflict of Interest

No Conflict of Interest

Version Notes

Initial version

Exports