Zur Hauptnavigation wechseln Zur Suche wechseln Zum Hauptinhalt wechseln

Data Exploration for Target Predictions Using Proprietary and Publicly Available Data Sets

Veröffentlichungen: Beitrag in FachzeitschriftArtikelPeer Reviewed

Abstract

When applying machine learning (ML) approaches for the prediction of bioactivity, it is common to collect data from different assays or sources and combine them into single data sets. However, depending on the data domains and sources from which these data are retrieved, bioactivity data for the same macromolecular target may show a high variance of values (looking at a single compound) and cover very different parts of the chemical space as well as the bioactivity range (looking at the whole data set). The effectiveness and applicability domain of the resulting prediction models may be strongly influenced by the sources from which their training data were retrieved. Therefore, we investigated the chemical space and active/inactive distribution of proprietary pharmaceutical data from Bayer AG and the publicly available ChEMBL database, and their impact when applied as training data for classification models. For this end, we applied two different sets of descriptors in combination with different ML algorithms. The results show substantial differences in chemical space between the two different data sources, leading to suboptimal prediction performance when models are applied to domains other than their training data. MCC values between -0.34 and 0.37 among all targets were retrieved, indicating suboptimal model performance when models trained on Bayer AG data were tested on ChEMBL data and vice versa. The mean Tanimoto similarity of the nearest neighbors between these two data sources indicated similarities for 31 targets equal to or less than 0.3. Interestingly, all applied methods to assess overlap of chemical space of the two data sources to predict the applicability of models beyond their training data sets did not correlate with observed performances. Finally, we applied different strategies for creating mixed training data sets based on both public and proprietary sources, using assay format (cell-based and cell-free) information and Tanimoto similarities.

OriginalspracheEnglisch
Seiten (von - bis)820-833
Seitenumfang14
FachzeitschriftChemical Research in Toxicology
Jahrgang38
Ausgabenummer5
DOIs
PublikationsstatusVeröffentlicht - 20 Apr. 2025

Fördermittel

We would like to acknowledge the resources provided by Bayer Pharma AG for this project. The authors thank Marina Garcia de Lomana for her support in the project. The Pharmacoinformatics Research Group (Ecker lab) acknowledges funding provided by the Austrian Science Fund FWF W1232 MolTag. Open Access is funded by the Austrian Science Fund (FWF). The Pharmacoinformatics Research Group (Ecker lab) acknowledges funding provided by the Austrian Science Fund FWF W1232 MolTag. Open Access is funded by the Austrian Science Fund (FWF).

ÖFOS 2012

  • 301207 Pharmazeutische Chemie
  • 102019 Machine Learning

Fingerprint

Untersuchen Sie die Forschungsthemen von „Data Exploration for Target Predictions Using Proprietary and Publicly Available Data Sets“. Zusammen bilden sie einen einzigartigen Fingerprint.

Zitationsweisen