Abstract
Accurate and efficient modeling of chemical reactions is paramount for advancements in catalysis, synthesis, and materials design. Machine learning potentials (MLPs) offer a computationally efficient alternative to \textit{ab initio} methods; however, developing broadly applicable reactive MLPs remains challenging due to inherent chemical complexity. Here, we present a scalable workflow for developing reactive MLPs specifically tailored to C, H, O, N-containing systems. Our approach involves constructing a large-scale pre-training dataset of over 17 million non-equilibrium structures along chemical reaction pathways, generated by combining the Nudged Elastic Band (NEB) method and structure alignment algorithms, with energies and forces labelled at the semi-empirical level. Subsequently, a high-precision fine-tuning dataset containing over 200,000 structures was efficiently built at the Density Functional Theory (DFT) level by integrating active learning methods. An array of model architectures and training paradigms, including pretraining-finetuning and transfer learning frameworks, were systematically benchmarked through rigorous evaluations. Through this process, we developed an optimized MLP model demonstrating state-of-the-art performance in both predictive accuracy and generalization capability for reactive chemical environments involving C, H, O, and N elements. Notably, when integrated with the machine-learned DFT model developed in our prior work, the resulting model achieves accuracy closely approaching that of coupled cluster calculations and demonstrably outperforms many conventional DFT methods across diverse reactive systems. This work establishes a robust framework for constructing highly accurate and transferable reactive MLPs, paving the way for large-scale, high-fidelity simulations of complex chemical processes relevant to numerous scientific and engineering disciplines.