ChemRxiv
These are preliminary reports that have not been peer-reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in news media as established information. For more information, please see our FAQs.
2/2
Supporting_Information.pdf (1.03 MB)
0/0

Unassisted Noise-Reduction of Chemical Reactions Data Sets

preprint
submitted on 29.05.2020 and posted on 01.06.2020 by Alessandra Toniato, Philippe Schwaller, Antonio Cardinale, Joppe Geluykens, Teodoro Laino

Existing deep learning models applied to reaction prediction in organic chemistry are able to reach extremely high levels of accuracy (> 90% for NLP- based ones1). With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve the performance of artificial intelligence models in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries (noise) from chemical reaction collections. Results show that models trained on cleaned and balanced data sets improve the quality of the predictions without a decrease in performance. For the retrosynthetic models the round-trip accuracy is enhanced by 13% and the value of the cumulative Jensen Shannon metric is lowered down to 70% of its original value, while maintaining high values of coverage (97%) and constant class-diversity (1.6) at inference.

History

Email Address of Submitting Author

ato@zurich.ibm.com

Institution

IBM Research Zurich

Country

Switzerland

ORCID For Submitting Author

https://orcid.org/0000-0002-5218-8653

Declaration of Conflict of Interest

none

Exports