Abstract
Transition state (TS) search is crucial for illuminating
chemical reaction mechanisms but remains the major bottleneck in automated discovery because of the high computational cost. Recently, machine learning interatomic potentials (MLIPs) and generative models have shown promise in accelerating TS search, but their comparative strengths and limitations remain unclear. In this study, we establish the first systematic and rigorous benchmarking framework to evaluate the effectiveness of ML methods in TS search, enabling a standardized and application-relevant assessment of their performance. Using an end-to-end TS search workflow, we benchmark seven representative MLIPs alongside React-OT, a state-of-the-art generative model. Our results demonstrate that pre-trained foundation MLIPs frequently fall short in reliably localizing TSs without task-specific finetuning. Furthermore, traditional energy and force metrics alone do not reliably predict TS search success, underscoring the need for more tailored evaluation criteria. Notably, React-OT frequently outperforms its MLIP counterpart, highlighting the potential of generative approaches for TS discovery. This benchmark serves as a critical foundation for the development and evaluation of future ML methods in chemical reactions, offering guidance for improving their generalizability and reliability in reactive chemistry.