Chemoinformatic Characterization of Synthetic Screening Libraries Focused on Epigenetic Targets

The importance of epigenetic drug and probe discovery is on the rise. This is not only paramount to identify and develop therapeutic treatments associated with epigenetic processes but also to understand the underlying epigenetic mechanisms involved in biological processes. To this end, chemical vendors have been developing synthetic compound libraries focused on epigenetic targets to increase the probabilities of identifying promising starting points for drug or probe candidates. However, the chemical contents of these data sets, the distribution of their physicochemical properties, and diversity remain unknown. To fill this gap and make this information available to the scientific community, we report a comprehensive analysis of eleven libraries focused on epigenetic targets containing more than 50,000 compounds. We used well-validated chemoinformatics approaches to characterize these sets, including novel methods such as automated detection of analog series and visual representations of the chemical space based on Constellation Plots and Extended Chemical Space Networks. This work will guide the efforts of experimental groups working on high-throughput and medium-throughput screening of epigenetic-focused libraries. The outcome of this work can also be used as a reference to design and describe novel focused epigenetic libraries.


Introduction
Epigenetic drug and probe discovery continue to be relevant for a plethora of potential therapeutic applications, including cancer therapy. [1,2] Probes for epigenetic targets would be key components to understand epigenome data that can be the basis to develop personalized medicine. [3] Similarly, chemical probes targeting epigenetic processes are fundamental in basic chemical biology associated with epigenetic processes. Computational and experimental screening efforts of chemical libraries have been ongoing to identify potential hits for various epigenetic targets. In addition to the traditional general screening libraries available for high-throughput screening, chemical companies have been assembling libraries focused on epigenetic targets [4] and the chemical structures are available in the public domain.
Indeed, in epigenetic drug discovery, there are recent and increased efforts to design and analyze focused libraries. [5,6] By design, epigenetic-focused libraries have the potential to increase the epigenetic relevant chemical space, which has recently been revised. [7] Depending on the library developer, the compounds in the compound data sets are selected following a multi-step procedure, typically through computational approaches (although the specific methodological details remain undisclosed to the public). Many providers emphasize the diversity of the libraries, yet it remains hidden to the user what their contents and coverage of the chemical space are.
We propose that a content analysis of a focused library would be informative before screening it for hits.
To this end, validated chemoinformatic approaches provide a systematic and rigorous manner to characterize the contents, chemical diversity, and, in general, coverage of the compound libraries in chemical space. [8] The goal of this study is to rigorously characterize the chemical content, diversity, and drug-like properties of eleven epigenetic-focused libraries containing more than 50,000 compounds in total. All data sets are available in the public domain. To the best of our knowledge, this is the first systematic chemoinformatic analysis of epigenetic-focused compound libraries. As part of the analysis, we implemented a recently introduced and validated methodology to compute and identify core structures and analog series based on retrosynthetic rules. The analog series were the basis to generate Constellation Plots as a rich representation of the chemical space that combines substructure and fingerprint representations. For each epigenetic-focused data set, we identified the most frequent core 3 structures. We also generate novel representations of the chemical space using the concept of Extended Chemical Space Networks (eCSNs) [9] and rapidly identify the most representative individual molecules (i.e., medoids) in a large data set: these chemical structures can be used as a "chemical structural marker" or "chemical diagnostic molecules." [10] Herein, we discuss the most representative chemical structures of the epigenetic libraries that can be used as a criterion for comparing the chemical libraries and prioritization for follow-up studies, including computational and experimental screening.

Methods
To conduct the chemoinformatic characterization of the 11 epigenetic-focused data sets, we calculated properties of pharmaceutical interest; molecular scaffolds using the classic and most common method by Bemis-Murcko, [11] and a novel approach to identify core structures and analog series automatically. [12,13] We also quantify the chemical diversity and synthetic accessibility using whole-molecule structural fingerprints. To explore the distribution of the chemical libraries in chemical space, we generated two complementary and recently introduced visual representations of the chemical space, namely, Constellation Plots [14] and eCSNs. [9] The details of the data sets, their preparation, calculation of the properties, scaffolds, and fingerprints are described in the following sections.

Data sets preparation
The structure files of the focused libraries were obtained from different chemical vendors. The 11 chemical libraries are summarized in Table 1 that briefly describes each set and the number of compounds before and after data curation. The number of compounds after data curation was 53,443, in total. The chemical structures were curated using the open-source cheminformatics toolkit RDKit, version 2021.03.2 (www.rdkit.org). Data curation was performed using an established protocol. [15] Briefly, compounds with valence errors or any chemical element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I were deleted. Stereochemistry information was removed because not all compounds in datasets have it defined. Compounds with multiple components were split, and the largest component was retained. The remaining compounds were neutralized and reionized to generate a canonical tautomer subsequently. Duplicated compounds were deleted. 4 <Table 1 here >

Properties of pharmaceutical interest
For each chemical structure of the 11 curated libraries, six properties of pharmaceutical interest were computed with RDKit, version 2021.03.2: molecular weight (MW), number of acceptor and donor hydrogen bonds (HBA, HBD, respectively), number of rotatable bonds (RB), topological surface area (TPSA), and partition coefficient octanol/water (SlogP). These properties are associated with compound size (MW), polarity (HBA, HBD, TSPA, SlogP), and flexibility (RB). The six properties are the basis of the well-known empirical rules of Lipinski [16] and Veber [17] (i.e., MW≤500, HBD≤5, HBA≤10, SlogP≤5, TPSA≤140, RB≤10) that help to guide the suitability of a compound to be orally absorbed (the preferred use of drug administration in most cases). The properties are not associated directly with biological activity but are commonly used to profile compound screening libraries in drug discovery projects.

Scaffolds: core structures and analog series
We used two approaches to systematically generate and analyze molecular scaffolds, exemplified in https://github.com/navejaromero/analog-series. [12,13] < Figure 1 here > To compute the Bemis-Murcko scaffolds [11] the side chains are removed, as illustrated in Figure 1.
More specifically, based on the graph representation of the chemical structures of all vertices of degree one. The concept of core scaffolds and analog series is also illustrated in Figure 1 and discussed in detail elsewhere. [12,13] Briefly, core scaffolds and analog series apply a series of fragmentation rules based on retrosynthetic feasibility. Two molecules are considered analogs if the fragmentation rules can map them to the same core fragment, and that fragment is a significant part of the molecule (usually, it contains at least two-thirds of the total number of heavy atoms of each molecule). 5

Structural diversity
The structural diversity of each compound library was computed using two distinct representations: scaffolds and fingerprints. Each type of representation is complementary and, as discussed elsewhere, [19] provides complementary information towards a more comprehensive assessment of the structural diversity of compound data sets.

Scaffold-based with analog series analysis
The scaffold diversity was also quantified based on the number of cores and analog series, and their ratio compared with the total number of molecules in the dataset. We have previously used the number of identifiable cores and analog series as a measure of chemical diversity in a dataset. [12] However, the 11 data sets presented here are too variable in their size. Therefore, we propose two new indicators: the fraction of molecules in the constellation plot (only analog series with at least three compounds are included there) and the average size of the analog series, measured as the number of unique compounds in the dataset divided by the number of analog series represented in it.

Fingerprint-based
The diversity of the compound libraries was computed with the extended similarity indices as recently reported. [20,21] Briefly, instead of only measuring the similarity between pairs of molecules, the extended (e.g., n-ary) indices allow us to calculate the similarity of any number of molecules simultaneously. [20] This has two key advantages when it comes to the analysis of molecular libraries. First, the n-ary indices provide a truly global description of the correlation between the molecules in the set (as exemplified by their superior performance in estimating the compactness of a set. [21] Second, the extended indices are dramatically more efficient, requiring only O(N) operations to calculate the similarity of N molecules (opposed to the quadratic scaling of the standard binary similarity indices). Extended similarity has been 6 successfully used in numerous applications, including compound diversity analysis, [21] comparison of nucleotide and protein sequences, [22] and, more recently, analysis of molecular dynamics simulations. [10] Herein, we used this approach as a novel and efficient manner to quantify and compare the fingerprintbased diversity of the 11 compound libraries. The extended similarity indices were computed with a fractional weight function, with various coincidence thresholds. The Python code to conduct the extended similarity indices calculations is freely available at https://github.com/ramirandaq/MultipleComparisons.

Chemical space: visual representation
Chemical space has been defined as the set of molecular descriptors in which molecules will be represented. Visual representation of the chemical space helps to better understand the mutual relationships between compounds in that multi-dimensional descriptor coordinates. [8] Among the several methods available (examples of the most common include principal component analysis and t-distributed stochastic neighbor embedding), herein we used two novel representations detailed in this section.

Constellation Plots
Constellation plots are useful to depict the chemical space of chemical libraries containing several analog series. [14] They are helpful at concisely depicting the structure-activity (property) relationships -SA(P)Rin a summarized representation of the data set, as analog series can be represented in fewer data points than individual compounds. [23,24] Of note, only a fraction of the total data is presented in the constellation plot: the compounds forming analog series; in this case, we included only analog series consisting of at least three compounds. Recently, constellation plots have been used to describe a library of antidiabetic natural products [25] and a collection of tubulin inhibitors. [26]

Extended Chemical Space Networks
The chemical space networks (CSNs), proposed and developed by Maggiora and Bajorath, start by measuring the pairwise similarity between the molecules in a data set (using a given similarity coefficient and a compound representation). [27,28] Then, the molecules are represented by nodes which are connected if the similarity is larger than an established threshold. A limitation of this approach is that it is 7 difficult to visualize networks for large compound data sets. Moreover, this approach also requires O(N 2 ) operations, so it is not well-suited to represent large sections of the chemical space. To overcome this issue, the eCSNs have been recently proposed (vide supra). This is a natural generalization of the CSNs, in which any given molecular set can be taken as a node in the network (in this study, the nodes will be the 11 libraries). Then, the relations between these nodes are established via the extended similarity calculated for the union of the corresponding libraries. This coarse-grained representation is markedly more efficient since it exploits the more favorable computational scaling of the n-ary indices.

Synthetic accessibility
The complexity of the compounds generated was estimated using the synthetic accessibility (SA) score previously published. [29] Briefly, the SA score implemented in this study is the difference between a fragment score and a complexity penalty. The fragment score captures common structural features in a large number of already synthesized molecules (934,046 representative molecules from the PubChem).
Molecules are fragmented using extended connectivity fragments, and the fragment score is calculated as a sum of contributions of all fragments in the molecule divided by the number of fragments in the molecule. The fragment frequency is related to their synthetic accessibility and hence easy-to-prepare substructures are present in molecules quite often. The complexity score is calculated as the sum of ring complexity (i.e., ring bridge atoms and spiro atoms), the number of stereocenters, large rings (i.e., ring size greater than eight, molecular complexity increases), and molecule size. The SA score was calculated for all epigenetic-focused libraries.

Results and Discussion
Properties of pharmaceutical relevance Figure S1 in the Supporting Information shows box plots and summary statistics of the distribution of the six calculated properties of pharmaceutical relevance. The profiling of the six properties indicated that, in general, all 11 compound libraries are within the Lipinski and Veber parameters. Based on this criterion, the libraries are acceptable candidate compounds for drug discovery and development programs (in particular, to be administered orally). All 11 compound data sets have a comparable distribution of the six 8 properties, as shown in Figure S1. The outcome of the profiling might be anticipated since it is likely that the chemical vendors (developers) filter or consider the so-called "drug-like" properties during the assembly of the focused libraries. However, the profiling disclosed in this work is relevant and encourages the experimental screening of the 11 compound libraries for drug discovery projects.

Scaffold content: core structures and analog series
The relevance of analyzing the main core scaffold of a chemical compounds, in the the context of drug discovery, is particularly relevant because the central element drives the main molecule shape, arrange the substituents in their specific positions and take part of the biological activity itself. [30] For this reason, systematic profiling of scaffold content of synthetic organic compounds for drug discovery is of utmost relevance.
Generating automatically and consistently the main scaffold or core structure of large data sets can be done in several ways as recently reviewed. [24] In general, it is desirable to generate the scaffolds rapidly, consistently, and interpretable, in particular for an organic or medicinal chemist working on chemical synthesis. As detailed in the Methods section, in this work we implemented two methodologies generating the Bemis-Murcko scaffolds and analog series based on core scaffolds (Figure 1). Concerning the core scaffolds and analog series, the most frequent are shown in Figure 2. The obtained cores overlap very little with the 201 substructures that have been annotated as epigenetic bioactive rings in a recent publication: of the 4016 cores matching at least three molecules from any epigenetic data set, only 19 contained at least one of the epigenetic rings. This highlights the structural novelty of the studied libraries [30] and it's potential to expand the epigenetic relevant chemical space. < Figure 2 here > 9 After the compound screening, the most frequent scaffolds, core structures, and analog series are potentially privileged or enriched towards epigenetic targets.

Structural diversity
Based on scaffolds Figure Figure 3B shows the percentage of scaffolds with a frequency of at least two per library. Clearly, ChemDiv was the data set with the largest proportion of non-unique scaffolds suggesting the lowest scaffold diversity.
< Figure 3 here > To further quantify scaffold diversity based on the Bemis-Murcko scaffolds, we used cyclic system recovery curves. As documented elsewhere, [31] based on the scaffolds count, the fraction of cyclic systems is plotted against the cumulative fraction of the database. A diagonal represents maximum scaffold diversity, i.e., each compound will have its chemical scaffold. A vertical line represents the minimum scaffold diversity (all compounds have the same scaffold). Figure 3C shows the cyclic system recovery curves for all data sets. The curves can be further characterized by the area under the curve, AUC (maximum diversity: AUC = 0.5; minimum diversity: AUC = 1.0). Table S1 in the Supporting Information summarizes the AUC values for the 11 data sets. The results show that ChemDiv is the least diverse, followed (AUC = 0.87) by OTAVA DNMT3b. In contrast, TocrisScreen and Targetmol are the most diverse (AUC <= 0.56).
The analog series analysis suggested that ChemDiv, Asinex, Life Chemicals, and OTAVA DNMT3b are the least diverse data sets, as they have a larger average of compounds per series. All the other libraries seem to be more diverse, and it is hard to point out at the least diverse from this perspective. <Table 2 here >

Based on fingerprints
Results of the fingerprint analysis are shown in Figure 4. The figure shows the similarity of the databases computed with RDKit fingerprints and the extended Tanimoto similarity coefficient, at different coincidence thresholds. The analysis revealed that the least diverse set is ChemDiv followed by Asinex.
Notably, ChemDiv is the compound data set with the largest number of compounds (27,543) meaning that the larger data set is not necessarily the one with the largest structural diversity, as clearly shown here. Similar results have been obtained for other data sets. [31] These results further emphasize the need to quantify the structural diversity. Similar conclusions regarding the relative diversity of the compound libraries were obtained with other extended similarity indices (in addition to Tanimoto) and MACCS keys, as shown in Figure S3 in the Supporting Information.

Constellation plots
We mapped all compounds into the same chemical space, regardless of the database they came from.
We were only interested in analog series having no less than three compounds, even if all of them belonged to different data sets. Afterwards, we highlighted the cores (points) represented in each database. Plots of representative epigenetic libraries are shown in Figure 5, while other libraries are depicted in Figure S4 of the Supporting Information.

Extended Chemical Space Networks
The eCSNs have been used to visualize the chemical space of 19 large data sets of organic compounds, including natural products, drugs approved for clinical use and other compound libraries, with more than 18 million molecules. [9] As discussed in that paper, this novel representation of the chemical space based on molecular fingerprints is an efficient method to compare the structural relationship among compound libraries. Figure 6 shows a visual representation of the chemical space of the 11 compound libraries using RDKit fingerprints. The network shows that ChemDiv and Asinex (which happen to be the least 11 diverse libraries based on RDKit fingerprints) are at the center of the representation with several connections (similarities) with other databases. In particular, ChemDiv (identified with the number 2 in this network: ID, 2) has the most number of connections, and these connections are closer (visualized by darker linkers between the nodes) with other libraries, such as ApeXBio (ID, 0), SelleckChem (ID, 8), and Targetmol (ID, 9). < Figure 6 here >  Table S2 in the Supporting Information. The medoids are calculated using the algorithm described recently. [10] In short, we calculate the complementary similarity of every molecule in a library, that is, the similarity of all but the selected molecule (which we can do in O(N) for the whole set). Then, we can rank all the molecules from more (e.g., medoid-like) to less (e.g., outlier-like) representative by simply ordering them according to the increasing value of their complementary similarity. The reasoning behind this is very simple: by removing a molecule that is closely related to all of the rest we leave behind a more 'disorganized' set, which will have a lower complementary similarity. In this context, the medoids could be interpreted as chemical structural markers or "signatures" of the compounds libraries and contribute to profile the chemical contents of each data set. <Figure 7 here >

Synthetic accessibility
We also profiled the chemical libraries using a validated in silico approach to estimate the synthetic accessibility, as described in the Methods section. [29] Results presented in Figure S5 Table 1. 12

Conclusions
Herein we report the first comprehensive chemoinformatic analysis of 11 compound libraries focused on epigenetic targets commercially available for screening. Different vendors have previously selected the molecular libraries, but their profile of properties, scaffold contents, and structural diversity was unknown.
Profiling of the six properties of pharmaceutical relevance: MW. LogP, HBD, HBA, TSPA, and RB, revealed that all 11 compound libraries are suitable to be screened in drug discovery campaigns to identify molecules that eventually could be orally administered. It was found that, other than benzene, Nphenyl-benzenesulfonamide, 1H-indol, and N-phenylbenzamide, and 1H-benzimidazol were the most prevalent. The results of the fingerprint-based diversity indicated that SelleckChem is among the most diverse libraries. In contrast, ChemDiv and Asinex are the least diverse, relative to all other data sets.
Regarding the Bemis-Murcko scaffolds and analog series, ChemDiv was also the least diverse, while TocrisScreen, Targetmol, and SelleckChem were the most diverse. Taken together, based on the results of structural diversity, the most diverse library overall (TocrisScreen) should be prioritized for experimental medium-throughput screening. Interestingly, out of the 11 databases analyzed, TocrisScreen was the smallest data set (100 compounds), yet it is the most diverse. In sharp contrast, ChemDiv was the largest data set (27,543 compounds) but is the least structurally diverse. The scaffold content and analog series, analyzed in the context of rings present in currently known compounds with activity against epigenetic targets revealed that the focused libraries have a large potential to expand the epigenetic relevant chemical space. Results of the calculated synthetic accessibility showed that all compound data sets are, in general, feasible to make. For practical applications, the libraries could be acquired first by the chemical vendors but, if needed, could be synthesized in-house. We anticipate that the results of the chemoinformatic characterization discussed in this work will assist research teams in the decision-making process and prioritize what libraries move forward to experimental screening in, for example, a high-throughput screening setting.  Figure 1. General approaches to compute molecular scaffolds. Note that the Bemis-Murcko approach maps every molecule to only one scaffold. Small changes in the scaffold results in a failure to identify analogs. On the other hand, the core approach is in many instances (but not always) able to identify such analogs. The molecules shown are only a small subset of an analog series consisting of over 400 compounds (AS7684).