Data Exploration for Target Predictions Using Proprietary and Publicly Available Data Sets

  • Aljoša Smajić
  • , Thomas Steger-Hartmann
  • , Gerhard F Ecker
  • , Anke Hackl (Corresponding author)

Publications: Contribution to journalArticlePeer Reviewed

Abstract

When applying machine learning (ML) approaches for the prediction of bioactivity, it is common to collect data from different assays or sources and combine them into single data sets. However, depending on the data domains and sources from which these data are retrieved, bioactivity data for the same macromolecular target may show a high variance of values (looking at a single compound) and cover very different parts of the chemical space as well as the bioactivity range (looking at the whole data set). The effectiveness and applicability domain of the resulting prediction models may be strongly influenced by the sources from which their training data were retrieved. Therefore, we investigated the chemical space and active/inactive distribution of proprietary pharmaceutical data from Bayer AG and the publicly available ChEMBL database, and their impact when applied as training data for classification models. For this end, we applied two different sets of descriptors in combination with different ML algorithms. The results show substantial differences in chemical space between the two different data sources, leading to suboptimal prediction performance when models are applied to domains other than their training data. MCC values between -0.34 and 0.37 among all targets were retrieved, indicating suboptimal model performance when models trained on Bayer AG data were tested on ChEMBL data and vice versa. The mean Tanimoto similarity of the nearest neighbors between these two data sources indicated similarities for 31 targets equal to or less than 0.3. Interestingly, all applied methods to assess overlap of chemical space of the two data sources to predict the applicability of models beyond their training data sets did not correlate with observed performances. Finally, we applied different strategies for creating mixed training data sets based on both public and proprietary sources, using assay format (cell-based and cell-free) information and Tanimoto similarities.

Original languageEnglish
Pages (from-to)820-833
Number of pages14
JournalChemical Research in Toxicology
Volume38
Issue number5
DOIs
Publication statusPublished - 20 Apr 2025

Austrian Fields of Science 2012

  • 301207 Pharmaceutical chemistry

Keywords

  • Machine Learning
  • Algorithms
  • Databases, Chemical
  • Pharmaceutical Preparations/chemistry
  • Databases, Factual

Fingerprint

Dive into the research topics of 'Data Exploration for Target Predictions Using Proprietary and Publicly Available Data Sets'. Together they form a unique fingerprint.

Cite this