Not always about you: Prioritizing community needs when developing endangered language technology

Type de ressource
Conference Paper
Auteurs/contributeurs
Title
Not always about you: Prioritizing community needs when developing endangered language technology
Abstract
Languages are classified as low-resource when they lack the quantity of data necessary for training statistical and machine learning tools and models. Causes of resource scarcity vary but can include poor access to technology for developing these resources, a relatively small population of speakers, or a lack of urgency for collecting such resources in bilingual populations where the second language is high-resource. As a result, the languages described as low-resource in the literature are as different as Finnish on the one hand, with millions of speakers using it in every imaginable domain, and Seneca, with only a small-handful of fluent speakers using the language primarily in a restricted domain. While issues stemming from the lack of resources necessary to train models unite this disparate group of languages, many other issues cut across the divide between widely-spoken low-resource languages and endangered languages. In this position paper, we discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face when working together to develop language technology to support endangered language documentation and revitalization. We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics. We describe an ongoing fruitful collaboration and make recommendations for future partnerships between academic researchers and language community stakeholders.
Date
2022-05
Proceedings Title
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Conference Name
ACL 2022
Place
Dublin, Ireland
Publisher
Association for Computational Linguistics
Pages
3933–3944
Short Title
Not always about you
Accessed
25/11/2024 13:23
Library Catalog
ACLWeb
Notes

Key points

Classify low-resource languages into 3 categories:

  • Elephant = many speakers, well supported (ex. Danish)
  • Ocelot = fewer speakers, well supported (ex. Yiddish)
  • Coyote = few speakers, little support (ex. Kodi)

    • often lack of standardized written form = mostly oral languages + dialectal/idiolectal variation
    • sometimes face(d) linguicide

Revitalization major goal = facilitating usage in wider contexts

Tools that help most:

  • //classroom education
  • //developping new usage situations

Shift from “linguist-focused model” to “Community-Based Language Research” -> community has the central role in the research, helped by outsider linguists

Survey speakers of different languages with different roles (teachers, L1/L2 speakers) on various technologies + general written documentation

« For many of these languages, the priorities of the speech communities are how to more effectively document, teach, and reclaim their language; how to save the cultural heritage passed down from the elders; and how to let their language have a voice among other widely-spoken or dominant languages » (Liu et al., 2022, p. 3938)

Recommendations:

  • Build bonds with indigenous communities responsibly and sensitively
  • Recognize community members who made meaningful contribution (co-authors or else appropriate)
  • describe data collection protocols and challenges
  • Explicit plans for sharing, archiving, storing the data (incl. with the speech community)
  • Consult with speech communities when building language technologies
  • Find ways to incorporate technology output into the work of the speech community

Référence
Liu, Z., Richardson, C., Hatcher, R., & Prud’hommeaux, E. (2022). Not always about you: Prioritizing community needs when developing endangered language technology. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3933–3944). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.272
Type de papier