Abstract
With the increasing application of deep learning based generative models for de novo molecule design,
quantitative estimation of molecular synthetic accessibility
becomes a crucial factor for prioritizing the structures generated from
generative models. On the other hand, it is also useful for helping prioritization of hit/lead compounds and guiding retro-synthesis analysis. In current study, based on the USPTO and Pistachio reaction datasets, we created a
chemical reaction network, in which a depth-first search was performed for identification
of the reaction paths of product compounds. This reaction dataset was then used
to build predictive model for distinguishing the organic compounds either as
easy synthesize (ES) or hard-to synthesize (HS) classes. Three synthesis
accessibility (SA) models were built using deep learning/machine learning
algorithms. The comparison between our three SA scoring functions with other
existing synthesis accessibility scoring schemes, such as SYBA, SCScore,
SAScore were also carried out. and the graph based deep learning model
outperforms those existing SA scores. Our results show that prediction models
based on historical reaction knowledge could be a useful tool for measuring
molecule complexity and estimating molecule SA.