Keep it Local: Comparing Domain-Specific LLMs in Native and Machine Translated Text using Parallel Corpora on Political Conflict

Javier Osorio, Sultan Alsarra, Amber Converse, Afraa Alshammari, Dagmar Heintze, Latifur Khan, Naif Alatrush, Patrick T. Brandt, Vito D'orazio, Niamat Zawad, Mahrusa Billah

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The dynamics of political conflict and cooperation require powerful computerized tools capable of effectively tracking security threats and cooperation around the world. This study compares the performance of domain-specific Large Language Models (LLMs) against generically-trained LLMs in binary and multi-class classification using native text in English, Spanish, and Arabic, and their corresponding machine translations. This endeavor yields four key contributions. 1) We present and make available a novel database of annotations using a multi-lingual parallel corpus from the United Nations. 2) Using various metrics, we assess the quality of different machine translation tools. 3) Our results indicate that the ConfliBERT family of LLMs, a set of domain-specific models tailored for political conflict, outperform generically-trained LLMs in English, Spanish, and Arabic in both binary and multi-class tasks. 4) We also disentangle the heterogeneous effects of machine translation on LLM performance in different languages. Overall, results reveal the comparative advantage of native-language domain-specific LLMs specialized on political conflict to understand the dynamics of violence and cooperation worldwide using native text. Our multi-lingual ConfliBERT LLMs provide critical cyber-infrastructure enabling scholars and government agencies use their local languages and information to foster safer, more stable political environments.

Original languageEnglish (US)
Title of host publication2024 2nd International Conference on Foundation and Large Language Models, FLLM 2024
EditorsYaser Jararweh, Jim Jansen, Mohammad Alsmirat
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages542-552
Number of pages11
ISBN (Electronic)9798350354799
DOIs
StatePublished - 2024
Event2nd International Conference on Foundation and Large Language Models, FLLM 2024 - Dubai, United Arab Emirates
Duration: Nov 26 2024Nov 29 2024

Publication series

Name2024 2nd International Conference on Foundation and Large Language Models, FLLM 2024

Conference

Conference2nd International Conference on Foundation and Large Language Models, FLLM 2024
Country/TerritoryUnited Arab Emirates
CityDubai
Period11/26/2411/29/24

Keywords

  • machine translation
  • Multilingual LLMs
  • political conflict
  • United Nations

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Software
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Keep it Local: Comparing Domain-Specific LLMs in Native and Machine Translated Text using Parallel Corpora on Political Conflict'. Together they form a unique fingerprint.

Cite this