TY - GEN
T1 - Fully unsupervised word segmentation with BVE and MDL
AU - Hewlett, Daniel
AU - Cohen, Paul
PY - 2011
Y1 - 2011
N2 - Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.
AB - Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.
UR - http://www.scopus.com/inward/record.url?scp=84859036614&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84859036614&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84859036614
SN - 9781932432886
T3 - ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
SP - 540
EP - 545
BT - ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics
T2 - 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011
Y2 - 19 June 2011 through 24 June 2011
ER -