Fully unsupervised word segmentation with BVE and MDL

Daniel Hewlett, Paul Cohen

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    14 Scopus citations

    Abstract

    Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

    Original languageEnglish (US)
    Title of host publicationACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics
    Subtitle of host publicationHuman Language Technologies
    Pages540-545
    Number of pages6
    StatePublished - 2011
    Event49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011 - Portland, OR, United States
    Duration: Jun 19 2011Jun 24 2011

    Publication series

    NameACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
    Volume2

    Other

    Other49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011
    Country/TerritoryUnited States
    CityPortland, OR
    Period6/19/116/24/11

    ASJC Scopus subject areas

    • Language and Linguistics
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'Fully unsupervised word segmentation with BVE and MDL'. Together they form a unique fingerprint.

    Cite this