Informal Persian Universal Dependency Treebank

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian, showing that these two variants have fundamental differences that cannot be attributed solely to pronunciation discrepancies. Given that informal Persian exhibits particular characteristics, any computational model trained on formal Persian is unlikely to transfer well to informal Persian, necessitating the creation of dedicated treebanks for this variety. We thus detail the development of the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependencies scheme. We then investigate the parsing of informal Persian by training two dependency parsers on existing formal treebanks and evaluating them on out-of-domain data, i.e. the development set of our informal treebank. Our results show that parsers experience a substantial performance drop when we move across the two domains, as they face more unknown tokens and structures and fail to generalize well. Furthermore, the dependency relations whose performance deteriorates the most represent the unique properties of the informal variant. The ultimate goal of this study that demonstrates a broader impact is to provide a stepping-stone to reveal the significance of informal variants of languages, which have been widely overlooked in natural language processing tools across languages.

Original languageEnglish (US)
Title of host publication2022 Language Resources and Evaluation Conference, LREC 2022
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages7096-7105
Number of pages10
ISBN (Electronic)9791095546726
StatePublished - 2022
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: Jun 20 2022Jun 25 2022

Publication series

Name2022 Language Resources and Evaluation Conference, LREC 2022

Conference

Conference13th International Conference on Language Resources and Evaluation Conference, LREC 2022
Country/TerritoryFrance
CityMarseille
Period6/20/226/25/22

Keywords

  • annotation
  • colloquial Persian
  • dependency parsing

ASJC Scopus subject areas

  • Language and Linguistics
  • Library and Information Sciences
  • Linguistics and Language
  • Education

Fingerprint

Dive into the research topics of 'Informal Persian Universal Dependency Treebank'. Together they form a unique fingerprint.

Cite this