Splitting chemical structure data sets for federated privacy-preserving machine learning

Jaak Simm; Lina Humbeck; Adam Zalewski; Noe Sturm; Wouter Heyndrickx; Yves Moreau; Bernd Beck; Ansgar Schuffenhauer

doi:10.26434/chemrxiv-2021-xd440-v2

Biological and Medicinal Chemistry

Search within Biological and Medicinal Chemistry

Splitting chemical structure data sets for federated privacy-preserving machine learning

20 October 2021, Version 2

This is not the most recent version. There is a

newer version

of this content available

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant,but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties in the federated learning. In this work we discuss three methods which provide a splitting of the data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria: bias in prediction performance, label and data imbalance, distance of the test set compounds to the training set and compare them to a random splitting. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.

Keywords

fold split

scaffold

locality sensitive hashing

sphere-exclusion clustering

Supplementary materials

Title

Description

Actions

Title

Figure S1

Description

Supporting figure 1

Actions

Supplementary weblinks

Title

Description

Actions

Title

ChemFold

Description

fold splitting package for machine learning in medicinal chemistry developed as part of this work

Actions

View

Title

Sparsechem

Description

machine learning package for biochemical applications

Actions

View

Title

MELLODDY-TUNER v1.0

Description

pipeline for data preparation developed as part of the MELLODDY project

Actions

View

Title

MELLODDY-TUNER v2

Description

Second version of MELLODDY tuner with integrated scaffold based binning

Actions

View

Title

Public dataset

Description

Public dataset derived from ChEMBL used in this manuscript

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Now Published

Splitting chemical structure data sets for federated privacy-preserving machine learning

Jaak Simm, Lina Humbeck, Adam Zalewski, Noe Sturm, Wouter Heyndrickx, Yves Moreau, Bernd Beck, Ansgar Schuffenhauer journal article

Journal of Cheminformatics , Volume 13, Issue 1

Online publication date: Dec 07, 2021

Version History

Nov 15, 2021 Version 3

Oct 20, 2021 Version 2

Jul 28, 2021 Version 1

Version Notes

The introduction and method section has been revised to add clarity

Metrics

1,971

787

Views

Downloads

Citations

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2021-xd440-v2

Funding

Innovative Medicines Initiative

831472

Author’s competing interest statement

The authors declare that they have no competing interests. The authors AZ, NS, AS, WH, BB and LH did the workas employee of Amgen, Novartis, Janssen and Boehringer Ingelheim, respectively.

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Splitting chemical structure data sets for federated privacy-preserving machine learning

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Now Published

Version History

Version Notes

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share