ChemPlot, a Python library for chemical space visualization

Visualizing chemical spaces streamlines the analysis of molecular datasets by reducing the information to human perception level, hence it forms an integral piece of molecular engineering, including chemical library design, high-throughput screening, diversity analysis, and outlier detection. We present here ChemPlot, which enables users to visualize the chemical space of molecular datasets in both static and interactive ways. ChemPlot features structural and tailored similarity methods, together with three different dimensionality reduction methods: PCA, t-SNE, and UMAP. ChemPlot is the first visualization software that tackles the activity/property cliff problem by incorporating tailored similarity. With tailored similarity, the chemical space is constructed in a supervised manner considering target properties. Additionally, we propose a metric, the Distance Property Relationship score, to quantify the property difference of similar (i.e. close) molecules in the visualized chemical space. ChemPlot can be installed via Conda or PyPI (pip) and a web application is freely accessible at https://www.amdlab.nl/chemplot/.

Visualization of chemical spaces usually consists of three main steps. In the first step, molecular representations (e.g. SMILES or InChI) are converted into binary or real-value arrays. Each element of the array corresponds to a dimension in the chemical space. In the second step, the high-dimensional chemical space is reduced to 2D or 3D spaces. In the final step, the reduced chemical space is converted into a visual representation. Fingerprints and descriptors [1][2][3] are the most widely used molecular encoding methods for the first step. Principal component analysis (PCA) 4 , self-organized maps (SOM) 5 , multidimensional scaling (MDS) 6 , and t-distributed stochastic neighbor embedding (t-SNE) 7 are the commonly used dimensionality reduction methods. Finally, visualizations can take the shape of either 2D or 3D, static or interactive, scatter or grid-like plots. In addition to these, there are alternative chemical space visualization approaches including the minimum spanning-trees 8 , similarity networks 9 , and uniform manifold approximation and projection (UMAP) 10 .
In practice, screening of molecules with desired properties is usually based on the similarity property principle (SPP), which assumes that similar molecules (close to each other in chemical space) show similar properties. However, it is not uncommon for pairs of similar molecules to violate the SPP principle 11 and such violations are referred to as activity/property cliffs (APCs).
APCs may mislead the screening and result in a deceleration of the discovery process of molecules with target properties. It is important to spot or, even better, prevent APCs to facilitate a speedy discovery process. Several visualization tools include additional analysis modules for APCs [12][13][14] . However, currently, there is no publicly available chemical space visualization tool that provides a remedy to the APC problem as part of the process.
Even though the majority of the chemical space visualization tools are distributed as an integrated feature of commercial software packages, several dedicated tools are also freely available 15 . In support of open science, the latter tools have been extensively serving a the chemical informatics community, although admitting that they carry the non-negligible sustainability concerns. For instance, free visualization tools are available as standalone software 12,13,16,17 and web server applications 14,18,19 .
However, they all require developer-dependent maintenance for version updates, bug-fixing, and adding new features. Therefore, with the completion of code development projects, the once-effective tools may become inaccessible, outmoded, or face compliance issues with the continually evolving operating systems.
Here, we present ChemPlot, a Python library for chemical space visualization. ChemPlot is free and its source code is available on GitHub under the Berkeley Source Distribution (BSD) license. ChemPlot is a user-friendly software, in which users can transform their datasets into interactive 2D chemical space visualizations. The current version (1.2.0) of ChemPlot provides three dimensionality reduction methods: PCA, t-SNE, and UMAP. Additionally, two similarity methods are supported: structural and tailored. According to our knowledge, ChemPlot is the first tool that applies tailored similarity 20 , which measures the similarity of molecules in a supervised manner considering the target properties of compounds, and thus provide a way to deal with the APC problem. Additionally, to better quantify the target property difference between molecules that are close to each other in a chemical space, we devise a new metric called the Distance Property Relationship (DPR) scores. ChemPlot can be installed via well-known Python package managers, including Conda and PyPI (pip). Finally, alongside the user manual that contains practical information and examples, we also provide a web interface to encourage the use of ChemPlot without the 2/18 need for coding expertise (see Code Availability).

RESULTS
In this section, we present the features of ChemPlot with help of example codes and figures. We start with the installation of ChemPlot, next exemplify various data visualizations, and then present the advanced functionalities of the code. Additionally, we compare the visualization performance of the currently available dimension reduction and similarity methods in ChemPlot.
Lastly, we illustrate a workable use of the ChemPlot web application.

Installation
ChemPlot is an open-source Python library with its source code stored on a GitHub repository (see Code Availability). In addition to a direct installation from the repository, users can also install the deployed packages via Conda and PyPI package management environments. ChemPlot runs on the Python framework and it can be installed from two alternative sources via the command prompt using either of the following commands.

Visualization
To visualize the chemical space of a dataset, users first need to construct a Plotter object. The Plotter object can be created from a molecular dataset that contains SMILES or InChI notations of compounds. The below code shows a basic example of t-SNE plotting from the SMILES data.

Target Coloring
ChemPlot allows coloring the molecules based on a given target property. The given property is automatically classified as a numerical or categorical variable and the coloring is applied accordingly. Alternatively, users can also specify the target type themselves. As an example, the following code shows how the target assignment can be applied while creating the Plotter object. cp = Plotter.from_smiles(smiles_list, target=target_list, target_type="C")

Dimensionality Reduction
In ChemPlot, three different dimensionality reduction methods (PCA, t-SNE, and UMAP) can be applied to map the molecules onto 2D. Figure 2 shows the representative visualizations of the Blood-brain barrier penetration (BBBP) 23 and AqSolDB 24 datasets that have been generated by applying these three methods. While PCA provides a linear projection of given dimensions, both t-SNE and UMAP apply non-linear 2D mappings by clustering and locating molecules depending on their local neighborhoods. For the t-SNE and UMAP parameters, users can choose between the two available options of: i) resorting to the optimized values that are calculated and assigned automatically by ChemPlot, or ii) assigning the values themselves.
Additional technicalities of the dimensionality reduction algorithms are provided in the Methods section. The following code shows the execution of the three different dimensionality reduction methods that are currently available in ChemPlot.

Molecular Similarity
ChemPlot implements two types, structural and tailored, of molecular similarity methods. Structural similarity uses the 5/18    Figure 2, but by using tailored similarity. Compared to the structural similarity, for the plots reduced by PCA, the explained variances are higher but at the same time the usage of the space is less efficient. For the plots reduced by t-SNE and UMAP, the clusters are separated from each other more clearly.
generated molecule substructures and ignores the target property when evaluating the similarity. Tailored similarity, on the other hand, uses only the descriptors that correlate with the target property. Therefore, while structural similarity produces more generic multi-purpose chemical spaces, tailored similarity produces property-sensitive focused chemical spaces. Figure 2 and Figure 3 show visualizations of the same datasets by employing structural and tailored similarity methods, respectively. In ChemPlot, the similarity type can be assigned while creating the Plotter object. The default value for the similarity type is "tailored" when a target list is provided, otherwise, it is "structural". The following code shows the similarity type assignment. cp = Plotter.from_smiles(smiles_list, target=target_list, sim_type="structural")

Clustering
ChemPlot features clustering of molecules in the reduced chemical space. The total number of clusters is defined by the n cluster parameter. In the visuals, a legend is included which shows the assigned color, cluster number, and share of coverage over the whole dataset for each cluster. Alternatively, users can interactively select and recolor clusters of their choice to visually distinguish them. This option is particularly convenient for evaluating the extrapolation performance of ML models, 6/18 Figure 4. The clustering of sub-spaces from the Lipophilicity dataset. In (a), the chemical space is distributed over five clusters. The colors, identifiers, and allotments of the clusters are included in the legend. In (b), the user-selected cluster, with blue color and id=0, is clearly distinguishable from the remaining clusters, shown with orange color, of the dataset.
such as by setting aside a test set that is positioned away from the training set. In addition to this, highlighting specific clusters or singletons could also provide a useful way when exploring and learning on the chemical space of the dataset. Figure 4 shows dimensionally been reduced by using the PCA method.

Additional Features
ChemPlot contains stochastic processes (t-SNE and UMAP) that generate different visualizations for each run. In order to create reproducible plots, users can set the random state parameter as shown below. cp.umap(random_state=0) Using ChemPlot, the users can identify and remove the outlier molecules. This option is controlled by a boolean parameter, which is by default set to "False". To remove the outliers in data, users can set this parameter to "True" as shown below.

cp.visualize_plot(remove_outliers=True)
To speed up the plotting process for the large datasets, ChemPlot comes with a PCA pre-reduction option that reduces the number of dimensions to 10 before applying the t-SNE or UMAP dimensionality reduction methods. Users can set the parameter to "True" to include the PCA pre-reduction step. cp.visualize_plot(size=20)

8/18
cp.interactive_plot(size=700) In ChemPlot, the default plotting type is Scatter. In addition, ChemPlot provides Kernel Density Estimation (KDE) and Hexagonal plotting options. Figure 5 shows the visualization of the same dataset by the three plotting types. Users can set the plotting type as shown below. multi-dimensions to 2D. We separately executed the reduction phase for the three different algorithms of PCA, t-SNE, and UMAP. The construction phase of the structural similarity is linearly correlated with the size of the dataset and it is about three to six times faster than the tailored similarity. For the largest dataset shown in Table 1, the construction time is approximately 74 and 453 s for structural and tailored similarities, respectively. In the reduction phase, the elapsed-time for PCA is linearly correlated with the size of the dataset, whereas it is polynomially correlated for UMAP and t-SNE. For the largest dataset, the reduction time using structural similarity was approximately 19, 180, and 429 s for PCA, UMAP, and t-SNE.

Distance Property Relationship Comparison
In this set of experiments, we compare the performances of various configurations of chemical space visualization on dealing with the APC problem. While traditional methods like Structure-Activity Relationships (SAR) 28 and Structure-Activity Landscape (SAL) 29 indexes provide quantification and detection of APCs of molecular datasets, there is no available method for quantifying APCs for chemical space visualizations. To measure the activity/property difference of similar (i.e. close) molecules in the visualized chemical space, we use the herein introduced metric of the DPR scores. Unlike SAR and SAL indexes, the DPR score defines the similarity based on distances of molecules on the reduced chemical space as opposed to basing it on their common chemical substructures. To achieve this, the DPR algorithm first calculates the Euclidean distances between all pairs of molecules and sorts them in ascending order (i.e. closest first). Next, for a selected top percentage, an average of the property difference is calculated. Table 2 shows the top 1, 2, and 5% DPR scores of the chemical space visualizations that are obtained by using various experimental configurations. On all datasets, among the three dimensionality reduction methods, t-SNE method produced the best DPR scores, while UMAP performed better than PCA. In addition, for the majority of experiments, the tailored similarity showed a better performance than the structural similarity.

Web Application
The ChemPlot web application provides users with an easy-to-use application programming interface (API) for visualizing chemical spaces. Users can upload their molecular datasets and choose the available visualization options from the left panel.
The API supports main functionalities of ChemPlot, including the similarity methods, dimensionality reduction methods, and plotting types. Additionally, it can be used to remove outliers and set a random state for reproducible plots. It generates interactive plots and allows users to export their visualized chemical spaces as an image or HTML file. Figure 5 shows an example view of the web application, which is openly accessible at https://www.amdlab.nl/chemplot/.

DISCUSSION
In this study, we introduced ChemPlot, a Python library for chemical space visualization. When designing ChemPlot, we aimed to provide a straightforward library where coding would be smooth and intuitive. This was achieved by providing a simple code flow and naming, handling the function parameters automatically, and preparing a descriptive user manual. ChemPlot allows users to convert molecular data files into chemical space visualizations simply by using a few lines of code and without setting any input parameters. Moreover, we developed a web application to streamline the visualization of chemical spaces without the

11/18
PCA t-SNE UMAP  Dataset  1%  2%  5%  1%  2%  5%  1%  2%  5%  S  T  S  T  S  T  S  T  S  T  S  T  S  T  S  T  S  T  FreeSolv Table 2. Top 1, 2, and 5% DPR scores of ChemPlot for the various experimental configurations applied on the sample datasets. S and T denote the structural and tailored similarity methods, respectively.
need of a coding background and essentially for daily use. Thus, the ChemPlot Python library alongside its user-friendly web application are designed to serve a broad community of users.
A noteworthy feature of ChemPlot is the application of tailored similarity for the visualization of chemical spaces. Before discussing this similarity feature, it is important to understand the similarity concept. Similarity is relative and it depends on the target and the population, therefore there is no globally applicable definition of similarity 11 . The traditionally used form is the structural similarity that attempts to evaluate the similarity using all possible aspects of the given molecular datasets and without taking the target property into account. This approach is truly suitable when one wants to have a general overview of the dataset 30 . However, structural similarity often fails when searching for new molecules with target properties. For example, when searching for molecules with a desired aqueous solubility value, one can expect that the neighboring molecules on the visualized chemical space would have close solubility values. Neighboring molecules, however, are more likely to have closer solubility values when, instead of the similarity defined by all the features, the similarity is defined essentially by the features that are known to affect the solubility. In relation to this, ChemPlot includes the tailored similarity option, which automatically identifies the correlated features with the given target and visualizes the chemical space based on them. The chemical spaces that are visualized by using the tailored similarity are less likely to be affected by the APC issue. Inline with this, for the majority of the datasets shown in Table 2, the tailored similarity produced lower DPR scores than the structural similarity.
In addition to its use in high-throughput virtual screening, diversity analysis, and virtual library design, it is also possible to use ChemPlot for visualizing the applicability domain of AI models. Here, the applicability domain refers to the region of chemical space where a model is expected to perform reliably and make confident predictions. For instance, when the predicted molecule is within the applicability domain, confidence can be put in the prediction, and when it is outside the applicability domain the prediction can be considered as dubious. By visualizing the chemical space of the training and test datasets of the models, the level of confidence of the prediction for new molecules can be determined based on their chemical space placement.
Additionally, ChemPlot can also be used to verify whether the dataset is properly split into train and test sets that are both adequately covering the chemical space 31,32 . Furthermore, by employing the clustering feature of ChemPlot, a test set region that is not properly covered by the training set can be intentionally put aside in order to evaluate the extrapolation capability of the model. Figure 4 shows an example use case of ChemPlot for identifying and isolating a test set that lies outside of the

12/18
domain of the training set. This way, the performance of a ML model, which has been trained with the data from the orange region, on the data from the blue region can be used as an indicator of the extrapolation capability of the model. Accordingly, we expect that ChemPlot will serve as a complementary tool to the ML models that are increasingly applied in the molecular informatics studies.
Advancing research software sustainability is a dire challenge. Software released as desktop or web applications can suffer from a lack of maintenance, therefore over time, can become inaccessible or completely unusable. We released ChemPlot as an open-source library on GitHub, thereby enabling the community to contribute to its future development. Code developers may contribute, such as by adding new features pertaining to dimensionality reduction techniques or similarity methods.
Additionally, they may have a chance to resolve user-specific technical issues when encountered. Moreover, in ChemPlot, an automated unit test workflow feature is implemented to validate the integrity of the to-be developed features by contributors in future (see Methods). Furthermore, we developed a distribution workflow that automatically builds and distributes the new ChemPlot package when a new version is released. To pave the way for long-term software quality, we share the responsibility in maintenance and development of ChemPlot. Thus, it can dynamically evolve in parallel to the future needs of the community.
The most significant practical limitation of ChemPlot is the computation time that will be required for excessively large datasets. In our performance tests, we encountered bottlenecks in both the construction and the reduction phases. In the construction phase, the calculation time of fingerprints and descriptors depends on both the total number of molecules and the total number of atoms in molecules. Also, the tailored similarity feature requires an additional step of feature selection.
ChemPlot uses the least absolute shrinkage and selection operator (lasso) regression analysis method and the logistic regression algorithms for feature selection. Depending on the algorithm used in this step, additional computation time is required.
Therefore, the construction phase takes more time for the tailored similarity. To accelerate this process, in the future versions of ChemPlot, the selection algorithms for instance can be optimized or replaced entirely by faster methods for the large datasets.
In the reduction phase, the production of all the plots took less than three minutes for both PCA and UMAP methods. However, for t-SNE, the production of the plots of large datasets took significantly longer time. For instance, for the largest dataset we tested (41,127 instances), it took approximately seven minutes to generate the plot by t-SNE (see Table 1). We used Barnes-Hut 33 implementation from scikit-learn 34 library which runs in O(NlogN), where N is the number of instances. The reduction time is naturally also affected by the number of dimensions that will be reduced. Therefore, it is usually recommended to use t-SNE once after the dimensions are reduced by PCA to a fixed value 33,35 although there is the risk of information loss.
ChemPlot includes a PCA pre-reduction option but the performance tests that are shown in Table 1 were conducted without this pre-reduction step. Therefore, for t-SNE, the total time consumed is largely penalized by the number of dimensions as well. To increase t-SNE's processing rate, it may be worthwhile to implement its recently proposed variants [35][36][37][38] in ChemPlot in future.

Structural Similarity
When computing the chemical space visualization based on the structural similarity, the list of molecules is converted into Extended-Connectivity Fingerprints (ECFPs) 39 . ECFPs are bit-vectors where each bit represents the presence or absence of a particular substructure.
Substructures are extracted from the main structure by starting from each non-hydrogen atom and extending to the neighbor atoms until a specified distance is reached. Extracted substructures are hashed and mapped into fixed-sized bit-vector. ChemPlot uses the RDKit 1 library to convert SMILES and InChI notations into ECFPs with a bit-vector length of 2,048 bits and radius of 2 adjacent atoms. After each molecule is converted, for all molecules, the bits that contain only 0s or only 1s are removed from the bit-vectors. The remaining number of bits determines the total number of dimensions and they are used as the input for the dimensionality reduction phase.

Tailored Similarity
When computing the chemical space visualization based on the tailored similarity, the list of molecules is converted into a set of descriptors as computed by using the Mordred library 3

Principal Component Analysis
PCA 4 is a linear dimensionality reduction algorithm, which projects the data points onto principal components by maximizing the variance.
ChemPlot applies PCA from scikit-learn 34 library with the default parameters. The two most significant principal components are used as the reduced dimensions (axes of the graph) in the visualization step.

t-distributed Stochastic Neighbor Embedding
t-SNE 7 is a non-linear dimensionality reduction algorithm that converts the similarities between data points into joint probabilities. It then minimizes the difference between the joint probability distributions of the high-dimensional data and the low-dimensional embedding. It is a stochastic process that produces different results from different initialization parameters. Except the perplexity parameter, ChemPlot applies t-SNE from scikit-learn 34 library using its default parameters. The perplexity parameter is computed automatically by the pre-trained model as described below.

Uniform Manifold Approximation and Projection
UMAP 10 is a non-linear dimensionality reduction algorithm that constructs a particular weighted k-neighbor graph for the given data points and then computes a low dimensional layout of the graph. It is based on a stochastic process that produces different results from different initialization parameters. ChemPlot applies UMAP with the default parameters provided by the UMAP library, except the n neighbors and min dist that are computed automatically by the pre-trained model described below.

14/18
Distance Property Relationship Score DPR score is a new method we developed to quantify the activity/property difference of similar molecules. It calculates the average activity/property difference of the most similar molecule pairs for a given percentage. An example pseudo-code is given below.

Automatic Parameter Assignment
The most important parameters that affect the visualization are perplexity for t-SNE and number of neighbors for UMAP. However, it is not possible to set fixed values of these parameters that would work best for all datasets. In order to determine the best values for a given dataset, we developed a linear model. For this purpose, we used 20 molecular datasets of varying sizes and generated multiple plots for each dataset by assigning different values to the target parameters. The best plots for human perception and the respective values were selected by two users for each dataset based on the clearness and the balanced distribution of the clusters and data points. The selected values were used as the labels to form linear equations between the data size and the target parameters. By using the trained model, ChemPlot automatically assigns the parameters to generate visually meaningful plots.

Clustering
For clustering of the molecules in a reduced chemical space, ChemPlot uses the k-means 40 method as implemented in the scikit-learn 34 library. The default value for the number of clusters is "5", while all other parameters are taken as are from scikit-learn. Once clustering of the data in the reduced dimensions is complete, it is possible to highlight a selected cluster by passing a list of cluster ids when calling the visualize plot function.

Static Plot
ChemPlot uses the Seaborn 41 library for generating the static plots. Static plots can be created by using one of three different visualization options, including the scatter, kernel density, and hexagonal bin plots. Static plots can be exported in PNG, JPEG, PDF, and SVG file formats.

Interactive Plot
ChemPlot uses the Bokeh 42 library for generating the interactive plots. The interactive plot allows users to interact actively with the visualization by providing drag, highlight, zoom, save, and reset functions. Hovering over the data points displays a tool-tip that contains a basic image of the molecule as rendered by RDKit 1 . The interactive plot can be exported in HTML format.

Removing Outliers
ChemPlot identifies the molecules as outliers in the case when their |z-score| ≥ 3 . Z-scores are calculated by using SciPy 43 library. The molecules that are identified as outliers can be removed from the plots by setting the remove outliers parameter to "True".

Unit Tests
Chemplot uses unittest Python framework for automated unit testing. Unit tests cover all construction, dimensionality reduction, and visualization methods. For all test cases, the related explanatory text is also provided. Unit tests are executed on three types of datasets: i) a dataset with a numerical target, ii) a dataset with a categorical target, and iii) a dataset that contains erroneous SMILES representations.

Visualization Tests
Visualization tests only cover the static plots. The test generates plots for the selected datasets with all possible combinations of parameters and puts them into a single PDF file for human inspection. It does not contain any automated validation.

Performance Tests
Performance tests measure the elapsed-time for the construction and the dimension reduction phases. The test is executed for nine molecular datasets of different sizes and by using all the similarity and dimension reduction methods. Each performance test provides two output files, where one includes a table of execution times shown in s and the other contains meta information of the test environment.

Web Application
ChemPlot web application is designed as an independent open-source software that imports ChemPlot as a library. It uses Streamlit library for creating the web interface.