Abstract
When preparing training, validation and test sets for machine learning on molecular datasets, it is desirable to combine two requirements: 1) robustness, i.e. making a test set that is chemically dissimilar from the training set; 2) data balance, i.e. ensuring that the proportion of data points and the distribution of data labels (categorical) / data values (continuous) are as homogeneous as possible among the sets, for each individual property to model, while partitioning the overall set of compounds as required. Recent literature shows that meeting both these requirements simultaneously is sometimes very difficult. This is especially true for multi-task learning, but also for single-task learning if one aims to balance the distribution of data labels or values, too. In this work we present a method that resolves this issue by first carrying out a chemistry-guided clustering of the initial dataset to ensure the separation of chemical matter, and subsequently applying linear programming to select the lists of clusters that – once assembled into the final sets – result in the best possible data balance.
Supplementary materials
Title
README.md
Description
brief user manual for data balancing python script
Actions
Title
balance_data_from_tasks_vs_clusters_array_pulp.py
Description
data balancing python script, subject to the license in 'COPYING' text file
Actions
Title
COPYING
Description
license file for the data balancing python script 'balance_data_from_tasks_vs_clusters_array_pulp.py'
Actions
Title
Datasets_and_results.zip
Description
the Dataset-specific files referred to in the paper in the Supporting material section
Actions