Developing language-tagged corpora for code-switching tweets

Suraj Maharjan, Elizabeth Blair, Steven Bethard, Thamar Solorio

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

Code-switching, where a speaker switches between languages mid-utterance, is frequently used by multilingual populations worldwide. Despite its prevalence, limited effort has been devoted to develop computational approaches or even basic linguistic resources to support research into the processing of such mixed-language data. We present a user-centric approach to collecting code-switched utterances from social media posts, and develop language universal guidelines for the annotation of code-switched data. We also present results for several baseline language identification models on our corpora and demonstrate that language identification in code-switched text is a difficult task that calls for deeper investigation.

Original languageEnglish (US)
Title of host publicationLAW 2015 - 9th Linguistic Annotation Workshop, held in conjuncion with NAACL 2015 - Proceedings of the Workshop
EditorsAdam Meyers, Ines Rehbein, Heike Zinsmeister
PublisherAssociation for Computational Linguistics (ACL)
Pages72-84
Number of pages13
ISBN (Electronic)9781941643471
StatePublished - 2020
Externally publishedYes
Event9th Linguistic Annotation Workshop, LAW 2015, held in conjuncion with NAACL 2015 - Denver, United States
Duration: Jun 5 2015 → …

Publication series

NameLAW 2015 - 9th Linguistic Annotation Workshop, held in conjuncion with NAACL 2015 - Proceedings of the Workshop

Conference

Conference9th Linguistic Annotation Workshop, LAW 2015, held in conjuncion with NAACL 2015
Country/TerritoryUnited States
CityDenver
Period6/5/15 → …

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Developing language-tagged corpora for code-switching tweets'. Together they form a unique fingerprint.

Cite this