Can Organic Chemistry Literature Enable Machine Learning Yield Prediction ?

25 March 2022, Version 1
This content is a preprint and has not undergone peer review at the time of posting.


Synthetic yield prediction using machine learning is intensively studied. While previous work focused on an ideal use case, High-Throughput Experiment datasets, predicting yields using literature data remains elusive. We built a large literature- based dataset of more than a thousand reactions, focusing on the activation of carbon-oxygen bonds of phenol derivatives under nickel catalysis. Detailed reaction conditions and associated yields were manually curated and stored in an open- access database. We assessed the performances of state-of-the-art machine learning models on this dataset, and explored their ability to realize predictions on novel publications, coupling partners and substrates. Our work shows that on well- designed yield prediction tasks, machine learning can have practical applications, and provides a unique public database for further improvements of these methods adapted to literature chemical data.


Machine Learning
Reaction Yield Prediction

Supplementary materials

Supplementary Informations
Details on the code and the methods used to train the model and featurize the data. Additional information supporting the main manuscript.

Supplementary weblinks


Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.