Abstract
The reaction dataset from the US Patent Office (USPTO), which is used broadly for training computer-assisted synthesis planning (CASP) retrosynthesis models, is biased towards a few over-represented reaction types such as palladium couplings and protecting group operations. Here we applied 14,325 reaction templates extracted from USPTO reactions to 1,505,837 USPTO molecules and used a transformer-based approach derived from our recently reported triple transformer loop (TTL) retrosynthesis model to test and validate up to 5,000 reactions per template. This approach yielded 25.7 million fictive reactions, from which we selected up to 90 reactions per template to form an equilibrated augmented dataset of 1,000,245 reactions. Combining the original USPTO dataset with this augmented dataset by multitask transfer learning produced a new TTL model with increased performance in terms of overall and template averaged single step round-trip accuracy. Further performance increases were obtained by applying a new disconnection-aware forward validation transformer.