Abstract
Identifying synthetic routes for molecules of interest is a crucial step when discovering new drugs or materials. To find synthetic routes, we can use computer-assisted synthesis planning using expansion policy networks trained on reaction templates extracted from patents and the literature. However, experience has shown that these networks are biased towards frequently reported reactions. This study shows that changing the molecular representation from an extended-connectivity fingerprint to a simple graph representation can increase the accuracy for templates used less than five times by 5.0- 8.5% points. We also illustrate that a simple oversampling of the training set yielded a top-1 accuracy increase in the 17-20% point range for templates used five times or less.