Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets

Giovanni A. Tricarico; Johan Hofmans; Eelke B. Lenselink; Miriam López-Ramos; Marie-Pierre Dréanic; Pieter F. W. Stouten

doi:10.26434/chemrxiv-2022-m8l33-v3

Biological and Medicinal Chemistry

Search within Biological and Medicinal Chemistry

Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets

29 March 2024, Version 3

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

When preparing training, validation and test sets for machine learning on molecular datasets, it is desirable to combine two requirements: 1) robustness, i.e. making a test set that is chemically dissimilar from the training set; 2) data balance, i.e. ensuring that the proportion of data points and the distribution of data labels (categorical) / data values (continuous) are as homogeneous as possible among the sets, for each individual property to model, while partitioning the overall set of compounds as required. Recent literature shows that meeting both these requirements simultaneously is sometimes very difficult. This is especially true for multi-task learning, but also for single-task learning if one aims to balance the distribution of data labels or values, too. In this work we present a method that resolves this issue by first carrying out a chemistry-guided clustering of the initial dataset to ensure the separation of chemical matter, and subsequently applying linear programming to select the lists of clusters that – once assembled into the final sets – result in the best possible data balance.

Keywords

Supplementary materials

Title

Description

Actions

Title

README.md

Description

brief user manual for data balancing python script

Actions

Title

balance_data_from_tasks_vs_clusters_array_pulp.py

Description

data balancing python script, subject to the license in 'COPYING' text file

Actions

Title

COPYING

Description

license file for the data balancing python script 'balance_data_from_tasks_vs_clusters_array_pulp.py'

Actions

Title

Datasets_and_results.zip

Description

the Dataset-specific files referred to in the paper in the Supporting material section

Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.