Sanitize It Yourself: Web-based molecular sanitization for machine-generated chemical structures

Correspondence: kazuki@ric.u-tokyo.ac.jp Isotope Science Center, The University of Tokyo, Tokyo, Japan Full list of author information is available at the end of the article Abstract Many computer-aided drug design (CADD) methods using deep learning have recently been proposed to explore the chemical space toward novel scaffolds efficiently. However, there is a tradeoff between the ease of generating novel structures and the chemical feasibility of structural formulas. To overcome the limitations of computational filtering, we have implemented a web-based software in which users can share and evaluate computer-generated compounds. The web service is available at https://sanitizer.chemical.space/.


Introduction
Computer-aided drug design (CADD) has become an even more active research field with the rise of deep learning [1]. The cooperation of researchers from various backgrounds ranging from organic chemistry to computer science is required to design feasible new compounds; however, it is not always easy to combine multidisciplinary insights. In recent years, there have been plenty of researches on molecular generative models. Still, some of these researches only look at numerical performance evaluations and lack the discussions about chemical perspectives of generated compounds. Extracting candidates of true value for drug development from computationally generated molecules requires a multifaceted evaluation. [2,3].
In fact, the usefulness of molecular generative models has been criticized by some medicinal chemists, in which they point out even algorithms claiming to attain high-performance scores often generate chemically infeasible molecules, and such molecules are referred to as "crazy structures." [4] In order to find out such inappropriate structures, it is necessary to define and calculate the appropriateness of generated molecules. Some researchers attempt to quantify the appropriateness based on synthetic feasibility by running automatic retrosynthesis tools and counting the number of synthetic steps required to produce the molecule [5,6]. These approaches can screen millions of generated compounds, but their reliability is not established yet, since automatic retrosynthesis tools sometimes give incorrect synthetic routes for even simple molecules [7]. Given such situations, human-based molecular sanitization is still necessary for ensuring the reliability of molecules generated by computers.
Decision-making in drug development often relies on the tacit knowledge of experienced medicinal chemists in the field [8], which renders it useful to gather a wide range of their views [9]. In this context, we have been proposing using collective knowledge through social web as one of the trends in open science for drug discovery, which we call social drug discovery. In tasks such as prioritizing compounds [10] or selecting the feasible 3D interaction structures in structure-based drug design process, we can harness the power of the crowd through means such as voting [11]. The majority opinion of scientists can be a valuable source of information for drug discovery. Likewise, we can apply the wisdom of crowds [12] to scrutinize computer-generated molecules.
In this paper, we introduce a web-based molecular sanitization tool for computer generated molecules. Users can share a list of generated molecules in the service and ask for an evaluation of their generative algorithm from a wide range of people. This web-based voting function democratizes the drug discovery research process and may facilitate social drug discovery. In the following chapters, we first describe the problem of molecules generated by popular algorithms from the perspective of medicinal chemists, then introduce the new visualization tools and finally describe the future perspective.

Problematic structures generated by molecular generation algorithms
The researches of generative models for molecules in the early days were mainly focused to increase the ratio of valid molecules among all generated ones. RDKit [13] has been used to assess validity [14], and validity is now regarded as one of the most important benchmarks for evaluating generative models [15]. After numerous efforts on increasing the validity ratio, the generative models in the most recent reports succeeded in achieving very high validity. However, it has been pointed out that such benchmark metrics including validity cannot properly evaluate the generated molecules. Renz and coworkers showed failure mechanisms of generative models. In the work, they exemplified the generated molecules could contain unstable, synthetically infeasible, or highly uncommon substructures [16]. In Figure 1, examples of such unwanted molecules are shown. In order to discard unwanted molecules, various quality filters have been proposed. For example, Pan Assay Interference Compounds [17] (PAINS), and medicinal chemistry filters (MCF) are implemented in MOSES packages [18]. Although these filters are useful to some extent, some unwanted molecules remain unfiltered because which substructures are unwanted depends on each user's individual situation. Therefore, users must prepare their own custom-defined filters to get meaningful generated molecules. In fact, REINVENT [19], one of the most cited and widely used generative models, provides Custom Alerts (CA) component, which enables users to define their own unwanted substructures. Actual preparation of custom filters is a laborious task, and tools for supporting visual inspection of users are essential to check and find out unwanted molecules/substructures among generated molecules.

Web-based molecular sanitization
We developed a web-based molecule visualization tool to enhance molecular sanitization through visual inspection. Organized visualization of the chemical space of molecules is necessary to check the computer-generated molecules. In addition, we encourage users to share their molecule lists with the world and vote for promising molecules to gain the wisdom of crowds [20] or collective human intelligence [21]. The web service is available at https://sanitizer.chemical.space/. Although the main goal of this project is to share molecules for collective knowledge, users can build their own server using the source code for the service if they want to deal with private data.

Implementation
Our website provides a graphical voting system for posted molecules. To implement the voting system, we need to store information of (1) molecules posted by users, User management and login We ask users to log in to the service to identify evaluators. We currently support OAuth authorization with Twitter or ORCID using Python Social Auth [25]. If the user is not logged in, they are treated as a guest user.
Project Users can create a new project or view existing projects on the dashboard page (Figure 2a). A project is a unit to control a list of molecules. The users can create a new project by uploading up to 10000 molecules in SMILES (Figure 2b) in one project. The uploaded molecules are processed by RDKit [13] to add tags on the server side. We currently support three tags: Rule of five [26], PAINS [17], and MCF [18]. Rule of five (Ro5) tag is added when the molecular property satisfies all of the following criteria: (i) the molecule has no more than 5 hydrogen bond donors, (ii) the molecule has no more than 10 hydrogen bond acceptors, (iii) the molecular mass is less than 500 daltons, and (iv) the calculated log P is less than 5. PAINS filter is implemented using RDKit's FilterCatalog, and MCF is implemented using the SMARTS list from the MOSES benchmark [18].
Molecule Viewer On the project page, the users can see the list of uploaded molecules. The molecule is rendered by RDKit for JavaScript on the client side. The users can evaluate molecules by pushing like and dislike buttons. They can filter molecules using tags (Ro5, PAINS, and MCF) or current users' evaluations (like or dislike). If the link to the project page is shared, multiple users can evaluate the molecules. They can export the SMILES of filtered or evaluated molecules.
Substructure Search Users can search molecules which contain specific substructures using SMARTS (SMILES arbitrary target specification) [27]. They can edit SMARTS with JSME [28] or input in the text box ( Figure 3a). Molecule editor can be invoked from the menu button of each molecule to search for similar molecules. The searched substructure is highlighted on the viewer page.
Nearest Neighbor Search Users can also search for similar molecules a specific molecule inside a project. Nearest neighbors are determined based on angular distance between MACCS (Molecular ACCess System) keys [29]. An approximate nearest neighbors search library Annoy [30] is used for fast neighbor search.
Molecule Info Each molecule has a separate page to check the detailed information.
The page contains the information of the molecular property to determine whether it satisfies the rule of five, the name of users who evaluated the molecule, and the information of which substructure is filtered by PAINS (if applicable). This page has two buttons for Twitter integration: to share the molecule in Twitter, and to send it to the retrosynthesis bot [31] (Figure 3c).

Case study
Evaluation of our web application was conducted by medicinal chemists. The goal of this case study was to find invalid molecules from molecules generated by one of the authors' previous work [32] using this visualization tool. According to the users, unwanted molecules could easily be found from the list of generated molecules thanks to the visualization and searching functionalities on this app. An example of unwanted molecules is shown in Figure 4.

Conclusion
In this study, we pointed out the problem of current molecular generative models. It is very likely that some unwanted compounds are contained in generated molecules. Benchmark metrics are not sufficient to prioritize and select compounds in an appropriate way from generated ones. It would be very helpful if molecules with chemically unstable or synthetically infeasible substructures are captured effectively and automatically. Although there are attempts to filter out unwanted structures, visual inspection by experts is still necessary. We implemented a web-based tool that eases molecular sanitization based on the visual inspection of experts. We will continue to develop the application reflecting the users' opinions.

Availability and requirements
• Availability of data and materials The website is available at https://sanitizer.chemical.space/.

Competing interests
The authors declare that they have no competing interests.

Funding
The server maintenance cost has been and will be paid by SHaLX Inc.
Authors' contributions NY implemented the software. NY, KR, and KZY wrote the paper. KZY maintains the cloud web servers.