Abstract
Combining quantum chemistry characterizations with generative machine learning models has the potential to accelerate molecular searches in chemical space. In this paradigm, quantum chemistry acts as a relatively cost-effective oracle for evaluating the properties of particular molecules while generative models provide a means of sampling chemical space based on learned structure-function relationships. For practical applications, multiple potentially orthogonal properties must be optimized in tandem during a discovery workflow. This carries additional difficulties associated with specificity of the targets and the ability for the model to reconcile all properties simultaneously. Here we demonstrate an active learning approach to improve the performance of multi-target generative chemical models. We first demonstrate the effectiveness of a set of baseline models trained on single property prediction tasks in generating novel compounds with various property targets, including both interpolative and extrapolative generation scenarios. For property ranges where accurate targeting proves difficult, the novel compounds suggested by the model are characterized using quantum chemistry to obtain the true values, and these new molecules closest to expressing the desired properties are fed back into the generative model for additional training. This gradually improves the generative models’ understanding of unknown areas of chemical space and shifts the distribution of generated compounds towards the targeted values. We then demonstrate the effectiveness of this active learning approach in generating compounds with multiple chemical constraints, including vertical ionization potential, electron affinity, and dipole moment targets, and validate the results at the B97X-D3/def2-TZVP level. This method requires no modifications to extant generative approaches, but rather utilizes their inherent generative and predictive aspects for self-refinement, and can be applied to situations where any number of properties with varying degrees of correlation must be optimized simultaneously.