Abstract
Generative models for the inverse design of molecules with particular properties have been heavily hyped but have yet to demonstrate significant gains over machine learning augmented expert intuition. A major challenge of such models is their limited accuracy in predicting molecules with targeted properties in the data scarce regime, which is the regime typical of the prized outliers that inverse models are hoped to discover. For example, activity data for a drug target or stability data for a material may only number in the tens to hundreds of samples, which is insufficient to learn an accurate and reasonably general property-to-structure inverse mapping from scratch. We’ve hypothesized that the property to structure mapping becomes unique when a sufficient number of properties are supplied to the models during training. This hypothesis has several important corollaries if true. It would imply that data scarce properties can be completely determined by a set of more accessible molecular properties. It would also imply that a generative model trained on multiple properties would exhibit an accuracy phase transition after achieving a sufficient size—a process analogous to what has been observed in the context of large language models. To interrogate these behaviors, we have built the first transformers trained on the property to molecular graph task, which we dub “large property models” (LPMs). A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data. The motivation for the large property model paradigm, the model architectures, and case studies are presented here for review and discussion at the upcoming Faraday Discussion on “Data-driven discovery in the chemical sciences”.