Description
Linguistic diversity is a major element of human cultures. While language variation is increasingly considered in language technology development, dialects are still underrepresented. This is despite the fact that dialect loss (Schilling-Estes & Wolfram, 1999) is as prevalent as the loss of languages all over the world. Measures to counter this loss, preserve dialects for future generations or even promote the usage of dialects in the present are, among others, crowdsourcing initiatives aimed at the creation of language data for the further development of language technologies.Illustrated by a citizen science project in the field of lexicography (Heinisch, 2020), this presentation demonstrates how language data can be generated together with dialect speakers and further processed to make them available for the use by language technologies. The presentation assesses the underlying ethical, technological and societal implications of crowdsourced language data generation for further use in language technologies.
Among the ethical implications is the instrumentalization of dialect speakers that might be outweighed by the benefits they gain. Furthermore, there are challenges resulting from the crowdsourced preservation of cultural heritage. Moreover, the reversed role between researchers and laypersons regarding expertise in the case of dialects emphasises the significance of lived experience in dialect use. Also the intended further usage of language data, such as for language technology development, give rise to ethical considerations.
Among the technological implications are usability considerations. Since the persons creating language data in the case of citizen science are usually not experts in linguistics it is necessary to strike a balance between simplifying the user interface while guaranteeing high data quality. Additionally, further use of the created language data by humans and machines needs to be taken into consideration. This may include making the data available in a visually appealing way for the contributors themselves and making them FAIR (findable, accessible, inter-operable, re-usable), such as through the Linguistic Linked Open Data Cloud (Cimiano, Chiarcos, Mccrae, & Gracia, 2020), thus opening them up for a wide range of potential (language technology) applications.
Societal implications of the crowdsourced creation of language data are the public perception of the endeavour itself, the effects on language technology development in everyday life and the impact on policy. The public perception of dialects, including ideologies and attitudes about language, has an effect on dialect vitality (Schneider, 2018). Therefore, the participation in and media coverage of dialect data initiatives may also influence the preparation of language policies and thus, stimulate the development of language technologies focussing on dialects.
Period | 15 May 2023 |
---|---|
Event title | 3rd International Conference ‘Language in the Human-Machine Era’ (LITHME) |
Event type | Conference |
Location | Groningen, NetherlandsShow on map |
Degree of Recognition | International |