TY - GEN
T1 - Mechanisms for automatic training data labeling for machine learning
AU - Gu, Yang
AU - Leroy, Gondy
N1 - Publisher Copyright:
© 40th International Conference on Information Systems, ICIS 2019. All rights reserved.
PY - 2019
Y1 - 2019
N2 - One of the most pervasive challenges in adopting machine or deep learning is the scarcity of training data. This problem is amplified in IS research, where application domains usually require specialized knowledge. This study compares three systems to create a large dataset for training when only a small amount of human-labeled data is available: a high-precision LSTM classifier, a high-recall LSTM classifier, and manually created rule-based system. Based on fewer than 20,000 human-labeled training examples, we used automated labeling to add an additional 100,000 examples to the training data. We found that combining a small human-labeled dataset with a system-labeled dataset improves classification performance. In our evaluation, adding training data labeled by the high-recall LSTM to the human-labeled dataset achieved F1 of 0.578, and adding training data labeled by the rule-based system achieved F1 of 0.598, over 4% improvement compared to a baseline system that only uses human-labeled data.
AB - One of the most pervasive challenges in adopting machine or deep learning is the scarcity of training data. This problem is amplified in IS research, where application domains usually require specialized knowledge. This study compares three systems to create a large dataset for training when only a small amount of human-labeled data is available: a high-precision LSTM classifier, a high-recall LSTM classifier, and manually created rule-based system. Based on fewer than 20,000 human-labeled training examples, we used automated labeling to add an additional 100,000 examples to the training data. We found that combining a small human-labeled dataset with a system-labeled dataset improves classification performance. In our evaluation, adding training data labeled by the high-recall LSTM to the human-labeled dataset achieved F1 of 0.578, and adding training data labeled by the rule-based system achieved F1 of 0.598, over 4% improvement compared to a baseline system that only uses human-labeled data.
KW - Deep learning
KW - Machine learning
KW - Natural language processing
KW - Self-labelling
KW - Training data discovery
UR - http://www.scopus.com/inward/record.url?scp=85114902214&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85114902214&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85114902214
T3 - 40th International Conference on Information Systems, ICIS 2019
BT - 40th International Conference on Information Systems, ICIS 2019
PB - Association for Information Systems
T2 - 40th International Conference on Information Systems, ICIS 2019
Y2 - 15 December 2019 through 18 December 2019
ER -