Machine learning and computer-aided approaches significantly accelerate molecular design and discovery in scientific and industrial fields increasingly relying on data science for efficiency. The typical method used is supervised learning which needs huge datasets. Semi-supervised machine learning approaches are effective to train unlabeled data with improved modeling performance, whereas they are limited by the accumulation of prediction errors. Here, to screen solvents for removal of methyl mercaptan, a type of organosulfur impurities in natural gas, we constructed a computational framework by integrating molecular similarity search and active learning methods, namely, molecular active selection machine learning (MASML). This new model framework identifies the optimal molecules set by molecular similarity search and iterative addition to the training dataset. Among all 126,068 compounds in the initial dataset, 3 molecules were identified to be promising for methyl mercaptan (MeSH) capture, including benzylamine (BZA), p-methoxybenzylamine (PZM), and N,N-diethyltrimethylenediamine (DEAPA). Further experiments confirmed the effectiveness of our modeling framework in efficient molecular design and identification for capturing methyl mercaptan, in which DEAPA presents a Henry's law constant 89.4% lower than that of methyl diethanolamine (MDEA).
Supporting Information - Intelligent Molecular Identification for High Performance Organosulfide Capture Using Active Machine Learning Algorithm