Mechanisms for automatic training data labeling for machine learning

Yang Gu, Gondy Leroy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

One of the most pervasive challenges in adopting machine or deep learning is the scarcity of training data. This problem is amplified in IS research, where application domains usually require specialized knowledge. This study compares three systems to create a large dataset for training when only a small amount of human-labeled data is available: a high-precision LSTM classifier, a high-recall LSTM classifier, and manually created rule-based system. Based on fewer than 20,000 human-labeled training examples, we used automated labeling to add an additional 100,000 examples to the training data. We found that combining a small human-labeled dataset with a system-labeled dataset improves classification performance. In our evaluation, adding training data labeled by the high-recall LSTM to the human-labeled dataset achieved F1 of 0.578, and adding training data labeled by the rule-based system achieved F1 of 0.598, over 4% improvement compared to a baseline system that only uses human-labeled data.

Original languageEnglish (US)
Title of host publication40th International Conference on Information Systems, ICIS 2019
PublisherAssociation for Information Systems
ISBN (Electronic)9780996683197
StatePublished - 2019
Event40th International Conference on Information Systems, ICIS 2019 - Munich, Germany
Duration: Dec 15 2019Dec 18 2019

Publication series

Name40th International Conference on Information Systems, ICIS 2019

Conference

Conference40th International Conference on Information Systems, ICIS 2019
Country/TerritoryGermany
CityMunich
Period12/15/1912/18/19

Keywords

  • Deep learning
  • Machine learning
  • Natural language processing
  • Self-labelling
  • Training data discovery

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Mechanisms for automatic training data labeling for machine learning'. Together they form a unique fingerprint.

Cite this