TY - JOUR
T1 - Machine translation vs. multilingual dictionaries: Assessing two strategies for the topic modeling of multilingual text collections.
AU - Waldherr, Annie
AU - Maier, Daniel
AU - Baden, Christian
AU - Stoltenberg, Daniela
AU - De Vries-Kedem, Maya
N1 - Publisher Copyright:
© 2021 The Author(s). Published with license by Taylor & Francis Group, LLC.
PY - 2022
Y1 - 2022
N2 - The goal of this paper is to evaluate two methods for the topic modeling of multilingual document collections: (1) machine translation (MT), and (2) the coding of semantic concepts using a multilingual dictionary (MD) prior to topic modeling. We empirically assess the consequences of these approaches based on both a quantitative comparison of models and a qualitative validation of each method’s potentials and weaknesses. Our case study uses two text collections (of tweets and news articles) in three languages (English, Hebrew, Arabic), covering the ongoing local conflicts between Israeli authorities, settlers, and Palestinian Bedouins in the West Bank. We find that both methods produce a large share of equivalent topics, especially in the context of fairly homogenous news discourse, yet show limited but systematic differences when applied to highly heterogenous social media discourse. While the MD model delivers a more nuanced picture of conflict-related topics, it misses several more peripheral topics, especially those unrelated to the dictionary’s focus, which are picked up by the MT model. Our study is a first step toward instrument validation, indicating that both methods yield valid, comparable results, while method-specific differences remain.
AB - The goal of this paper is to evaluate two methods for the topic modeling of multilingual document collections: (1) machine translation (MT), and (2) the coding of semantic concepts using a multilingual dictionary (MD) prior to topic modeling. We empirically assess the consequences of these approaches based on both a quantitative comparison of models and a qualitative validation of each method’s potentials and weaknesses. Our case study uses two text collections (of tweets and news articles) in three languages (English, Hebrew, Arabic), covering the ongoing local conflicts between Israeli authorities, settlers, and Palestinian Bedouins in the West Bank. We find that both methods produce a large share of equivalent topics, especially in the context of fairly homogenous news discourse, yet show limited but systematic differences when applied to highly heterogenous social media discourse. While the MD model delivers a more nuanced picture of conflict-related topics, it misses several more peripheral topics, especially those unrelated to the dictionary’s focus, which are picked up by the MT model. Our study is a first step toward instrument validation, indicating that both methods yield valid, comparable results, while method-specific differences remain.
UR - http://www.scopus.com/inward/record.url?scp=85112703806&partnerID=8YFLogxK
U2 - https://doi.org/10.1080/19312458.2021.1955845
DO - https://doi.org/10.1080/19312458.2021.1955845
M3 - Article
VL - 16
SP - 19
EP - 38
JO - Communication Methods & Measures
JF - Communication Methods & Measures
SN - 1931-2458
IS - 1
ER -