Abstract
Recently, the growth of commercially-available molecules has been driven by “tangible” make-on-demand, virtual libraries. Such billion-molecule libraries can never be fully synthesized, tested, or even stored. The only way to explore this expanded chemical space is by computationally prioritizing particular molecules for synthesis and testing, often by docking. The success of this prioritization may depend on library properties: how diverse are the molecules, how similar are they to bio-like molecules, such as metabolites and drugs, how does receptor-fit improve with library size, and how does the presence of artifacts grow with library size? To begin to investigate these questions, we compare the characteristics and performance of a library of 3 million “in-stock” molecules with that of ever-larger tangible libraries, up to 3 billion molecules in size. The bias toward biologically precedented molecules of the 886-fold larger tangible library decreases 19,000-fold compared to the in-stock library. Looking at docking hits, and not the overall libraries, thousands of high-ranking synthesized and tested tangible compounds from five ultra-large library docking campaigns are also dissimilar to bio-like molecules. These observations imply that bio-likeness plays little role in the likelihood of binding, appearing to contradict multiple studies to the contrary. Another important aspect of library growth is whether screening ever-larger libraries leads to better ligands. Judged by docking score, better fitting molecules are found as the library grows, with score improving log-linearly with library size. Finally, it is possible to imagine that as library size increases, so too do the chances of rare events—molecules that cheat the scoring function and rank artifactually well. Both simulation and experimental results from ultra-large library screens suggest that this is true—as the libraries grow, more and more artifacts can crowd the very top-ranking molecules. Although the nature of these artifacts appears to change from target to target, the expectation of their occurrence does not, and simple strategies may be devised to minimize the impact of these rare-event artifacts on the success of large library screens.