Biological and Medicinal Chemistry

Splitting chemical structure data sets for federated privacy-preserving machine learning



With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant,but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties in the federated learning. In this work we discuss three methods which provide a splitting of the data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria: bias in prediction performance, label and data imbalance, distance of the test set compounds to the training set and compare them to a random splitting. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.

Version notes

The introduction and method section has been revised to add clarity


Thumbnail image of Splitting_datasets_in_federated_privacy_preserving_setting_revised.pdf

Supplementary material

Thumbnail image of Fig_S1_results_performance_deident.png
Figure S1
Supporting figure 1

Supplementary weblinks

fold splitting package for machine learning in medicinal chemistry developed as part of this work
machine learning package for biochemical applications
pipeline for data preparation developed as part of the MELLODDY project
Second version of MELLODDY tuner with integrated scaffold based binning
Public dataset
Public dataset derived from ChEMBL used in this manuscript