TY - GEN
T1 - Paralinguistic classification of mask wearing by image classifiers and fusion
AU - Szep, Jeno
AU - Hariri, Salim
N1 - Publisher Copyright:
© 2020 ISCA
PY - 2020
Y1 - 2020
N2 - In this study, we address the ComParE 2020 Paralinguistics Mask sub-challenge, where the task is the detection of wearing surgical masks from short speech segments. In our approach, we propose a computer-vision-based pipeline to utilize the capabilities of deep convolutional neural network-based image classifiers developed in recent years and apply this technology to a specific class of spectrograms. Several linear and logarithmic scale spectrograms were tested, and the best performance is achieved on linear-scale, 3-Channel Spectrograms created from the audio segments. A single model image classifier provided a 6.1% better result than the best single-dataset baseline model. The ensemble of our models further improves accuracy and achieves 73.0% UAR by training just on the 'train' dataset and reaches 80.1% UAR on the test set when training includes the 'devel' dataset, which result is 8.3% higher than the baseline. We also provide an activation-mapping analysis to identify frequency ranges that are critical in the 'mask' versus 'clear' classification.
AB - In this study, we address the ComParE 2020 Paralinguistics Mask sub-challenge, where the task is the detection of wearing surgical masks from short speech segments. In our approach, we propose a computer-vision-based pipeline to utilize the capabilities of deep convolutional neural network-based image classifiers developed in recent years and apply this technology to a specific class of spectrograms. Several linear and logarithmic scale spectrograms were tested, and the best performance is achieved on linear-scale, 3-Channel Spectrograms created from the audio segments. A single model image classifier provided a 6.1% better result than the best single-dataset baseline model. The ensemble of our models further improves accuracy and achieves 73.0% UAR by training just on the 'train' dataset and reaches 80.1% UAR on the test set when training includes the 'devel' dataset, which result is 8.3% higher than the baseline. We also provide an activation-mapping analysis to identify frequency ranges that are critical in the 'mask' versus 'clear' classification.
KW - Computational paralinguistics
KW - Convolutional neural networks (CNN)
KW - Ensemble learning
KW - Image-classification
KW - Spectrogram
UR - http://www.scopus.com/inward/record.url?scp=85098160589&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098160589&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2857
DO - 10.21437/Interspeech.2020-2857
M3 - Conference contribution
AN - SCOPUS:85098160589
SN - 9781713820697
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 2087
EP - 2091
BT - Interspeech 2020
PB - International Speech Communication Association
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -