TY - JOUR
T1 - Increasing the intelligibility and naturalness of alaryngeal speech using voice conversion and synthetic fundamental frequency
AU - Dinh, Tuan
AU - Kain, Alexander
AU - Samlan, Robin
AU - Cao, Beiming
AU - Wang, Jun
N1 - Funding Information:
This material is based upon work supported by the National Institutes of Health under grants R01DC016621 and R03DC013990.
Publisher Copyright:
Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - Individuals who undergo a laryngectomy lose their ability to phonate. Yet current treatment options allow alaryngeal speech, they struggle in their daily communication and social life due to the low intelligibility of their speech. In this paper, we presented two conversion methods for increasing intelligibility and naturalness of speech produced by laryngectomees (LAR). The first method used a deep neural network for predicting binary voicing/unvoicing or the degree of aperiodicity. The second method used a conditional generative adversarial network to learn the mapping from LAR speech spectra to clearly-articulated speech spectra. We also created a synthetic fundamental frequency trajectory with an intonation model consisting of phrase and accent curves. For the two conversion methods, we showed that adaptation always increased the performance of pre-trained models, objectively. In subjective testing involving four LAR speakers, we significantly improved the naturalness of two speakers, and we also significantly improved the intelligibility of one speaker.
AB - Individuals who undergo a laryngectomy lose their ability to phonate. Yet current treatment options allow alaryngeal speech, they struggle in their daily communication and social life due to the low intelligibility of their speech. In this paper, we presented two conversion methods for increasing intelligibility and naturalness of speech produced by laryngectomees (LAR). The first method used a deep neural network for predicting binary voicing/unvoicing or the degree of aperiodicity. The second method used a conditional generative adversarial network to learn the mapping from LAR speech spectra to clearly-articulated speech spectra. We also created a synthetic fundamental frequency trajectory with an intonation model consisting of phrase and accent curves. For the two conversion methods, we showed that adaptation always increased the performance of pre-trained models, objectively. In subjective testing involving four LAR speakers, we significantly improved the naturalness of two speakers, and we also significantly improved the intelligibility of one speaker.
KW - Speech intelligibility
KW - Total laryngectomy
KW - Voice conversion
UR - http://www.scopus.com/inward/record.url?scp=85098214365&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098214365&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1196
DO - 10.21437/Interspeech.2020-1196
M3 - Conference article
AN - SCOPUS:85098214365
SN - 2308-457X
VL - 2020-October
SP - 4781
EP - 4785
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -