Collective Intelligence of Specialized Language Models Guides Realization of de novo Chemical Synthesis

Haote Li; Sumon Sarkar; Wenxin Lu; Patrick Loftus; Tianyin Qiu; Yu Shee; Abbigayle Cuomo; John-Paul Webster; H. Ray  Kelly; Vidhyadhar  Manee; Sanil Sreekumar; Frederic Buono; Robert Crabtree; Timothy Newhouse; Victor Batista

doi:10.26434/chemrxiv-2025-dc28b

Organic Chemistry

Search within Organic Chemistry

Collective Intelligence of Specialized Language Models Guides Realization of de novo Chemical Synthesis

31 January 2025, Version 1

Working Paper

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

While hundreds of thousands of new chemical reactions are reported annually, efficient use of this vast collection of synthetic knowledge remains a persistent challenge in modern chemistry. Recent applications of large language models (LLMs) have shown promise, but systems that reliably work for de novo compounds and molecular transformations have remained elusive. Here we introduce MOSAIC (Multiple Optimized Specialists for AI-Driven Chemical Prediction), a computational framework that enables chemists to harness the collective knowledge of millions of reaction protocols. In contrast to existing approaches relying on agentic models, MOSAIC leverages the open-source Llama3.1-8B-instruct architecture. By training 2,489 specialized chemical experts on Voronoi-clustered reaction spaces, we establish a scalable paradigm that delivers reproducible and human-readable experimental protocols for complex syntheses. Experimental validation demonstrates MOSAIC's ability to predict and execute previously unreported transformations, including challenging reactions via Buchwald-Hartwig amination, Suzuki coupling, and olefin metathesis. We validate this approach through the successful synthesis of over 35 novel compounds spanning pharmaceuticals, materials, agrochemicals, and cosmetics. This framework establishes a new relationship between computational and experimental chemistry, providing a foundation for accelerated chemical discovery across disciplines.

Keywords

Large Language Models

Collective Chemical Intelligence

Reaction Development

Organic Synthesis and Reaction

Supplementary materials

Title

Description

Actions

Title

Supplementary Information

Description

The Supplementary Information contains detailed computational and experimental data such as training logs, spectra and procedures.

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Jan 31, 2025 Version 1

Metrics

3,097

1,988

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2025-dc28b

Funding

Boehringer Ingelheim

National Science Foundation

2302908

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Collective Intelligence of Specialized Language Models Guides Realization of de novo Chemical Synthesis

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share