Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets

29 March 2024, Version 3
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

When preparing training, validation and test sets for machine learning on molecular datasets, it is desirable to combine two requirements: 1) robustness, i.e. making a test set that is chemically dissimilar from the training set; 2) data balance, i.e. ensuring that the proportion of data points and the distribution of data labels (categorical) / data values (continuous) are as homogeneous as possible among the sets, for each individual property to model, while partitioning the overall set of compounds as required. Recent literature shows that meeting both these requirements simultaneously is sometimes very difficult. This is especially true for multi-task learning, but also for single-task learning if one aims to balance the distribution of data labels or values, too. In this work we present a method that resolves this issue by first carrying out a chemistry-guided clustering of the initial dataset to ensure the separation of chemical matter, and subsequently applying linear programming to select the lists of clusters that – once assembled into the final sets – result in the best possible data balance.

Keywords

machine learning
QSAR
robustness
data balance
label balance
class balance
multi-task

Supplementary materials

Title
Description
Actions
Title
README.md
Description
brief user manual for data balancing python script
Actions
Title
balance_data_from_tasks_vs_clusters_array_pulp.py
Description
data balancing python script, subject to the license in 'COPYING' text file
Actions
Title
COPYING
Description
license file for the data balancing python script 'balance_data_from_tasks_vs_clusters_array_pulp.py'
Actions
Title
Datasets_and_results.zip
Description
the Dataset-specific files referred to in the paper in the Supporting material section
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.