Abstract
Traditional Chinese Medicine (TCM) has long been regarded as a valuable resource for modern drug discovery. However, the limited availability of recorded entities and information, the complexity and sparsity of the herb–ingredient–target–disease network, and inconsistencies in data representation hinder the effectiveness of high-throughput screening approaches. While some therapeutically valuable compounds from TCM have been discovered through manual experimental screening, such methods are time-consuming and require substantial human resources. To address these challenges, we developed a data-driven and deep learning–based workflow, TCM-Navigator, that enables the in-silico generation, quality control, and physics-based evaluation of TCM-like molecules. The generation is done by TCM-Generator, a transfer learning- and LSTM-based chemical language model that generates standardized, hierarchically structured, and high-throughput–friendly datasets of TCM-like molecules. In this study, we generated a target-nonspecific dataset comprising 3.7 million TCM-like molecules, expanding the number of entities in existing TCM datasets by more than 100-fold. The workflow also enables flexible, goal-driven molecule generation customized for specific targets, yielding three target-specific datasets and multiple high-potential target-ligand pairs. The quality control is done by TCM-Identifier, the first quantitative model specifically designed to capture unique characteristics of TCM, using an AttentiveFP framework with Message Passing Neural Networks (MPNNs). TCM-Identifier is expected to serve as an essential evaluation and guidance tool for TCM-related drug development. Our workflow bridges cutting-edge data science—including deep learning—with biomedical research to tackle longstanding challenges in target identification and molecular design. Its adaptable framework is also transferable to interdisciplinary innovation beyond drug development.
Supplementary materials
Title
Supplementary figures
Description
Supplementary Figures 1 to 5 Referenced in the Main Manuscript
Actions
Title
Supplementary Methods
Description
Supplementary Methods mentioned in the main text, including: Datasets, Compound Generation with TCM-Generator, Evaluation of Chemical Space, ADMET and Chemical Properties Analysis, TCM-Identifier, Molecular Docking, and Molecular Dynamics (MD) Simulation.
Actions