Abstract
Protein-ligand interaction prediction with proteochemometric (PCM) models can provide valuable insights during early drug discovery and chemical safety assessment. These models have benefitted from the large amount of data available in bioactivity databases. However, an issue that is often overlooked when using this data, is the broad diversity in the biological assays present. The effect of small molecules on a protein can be measured in various ways and this can influence the outcome. Yet, currently there is a lack of standardized, specific assay metadata, whilst this could help increase understanding of the origin of datapoints, improve data curation and lead to better models.
To make use of the existing information on the biological context, we set out to create and validate multiple assay descriptors and test their use in protein-ligand interaction models. Dimensionality reduction of embedded free text assay descriptions from ChEMBL showed that the BioBERT embeddings capture relevant features. Additionally, clustering of these embedded descriptions groups the assays in a way that enriches purity, matches manually categorized assays and yields sensible topic describing words. From ligand-protein combinations with multiple measurements, it becomes apparent that the deviation between different measurements in general is higher than the deviation of measurements within assay categories, with a logarithmic mean absolute error of 0.83 and 0.66, respectively. Incorporating this biological context into the PCM models in the form of BioBERT-based embeddings improved the average R2 from 0.67 to 0.69 across different datasets and splits. Conversely, using simpler methods such as bag-of-words (in which frequently used words are used as features) no improvement was seen (average R2 0.66). In addition, this novel method for assay categorization facilitates data curation and provides a useful overview of the biological context of studied targets. In conclusion, biological assay context is important for bioactivity modeling and provides a means to easily get insight into this context.