TY - JOUR
T1 - Multilingual chief complaint classification for syndromic surveillance
T2 - An experiment with Chinese chief complaints
AU - Lu, Hsin Min
AU - Chen, Hsinchun
AU - Zeng, Daniel
AU - King, Chwan Chuen
AU - Shih, Fuh Yuan
AU - Wu, Tsung Shu
AU - Hsiao, Jin Yi
N1 - Funding Information:
This work was supported in part by the U.S. National Science Foundation through Grant #IIS-0428241 (“A National Center of Excellence for Infectious Disease Informatics”). It also draws on earlier work supported by the Arizona Department of Health Services. Chwan-Chuen King acknowledges support from the Taiwan Department of Health (DOH95-DC-1021). Daniel Zeng wishes to acknowledge support from the National Natural Science Foundation of China (60573078 and 60621001), the Chinese Academy of Sciences (2F05N01 and 2F07C01) and the Ministry of Science and Technology (2006AA010106).
PY - 2009/5
Y1 - 2009/5
N2 - Purpose: Syndromic surveillance is aimed at early detection of disease outbreaks. An important data source for syndromic surveillance is free-text chief complaints (CCs), which may be recorded in different languages. For automated syndromic surveillance, CCs must be classified into predefined syndromic categories to facilitate subsequent data aggregation and analysis. Despite the fact that syndromic surveillance is largely an international effort, existing CC classification systems do not provide adequate support for processing CCs recorded in non-English languages. This paper reports a multilingual CC classification effort, focusing on CCs recorded in Chinese. Methods: We propose a novel Chinese CC classification system leveraging a Chinese-English translation module and an existing English CC classification approach. A set of 470 Chinese key phrases was extracted from about one million Chinese CC records using statistical methods. Based on the extracted key phrases, the system translates Chinese text into English and classifies the translated CCs to syndromic categories using an existing English CC classification system. Results: Compared to alternative approaches using a bilingual dictionary and a general-purpose machine translation system, our approach performs significantly better in terms of positive predictive value (PPV or precision), sensitivity (recall), specificity, and F measure (the harmonic mean of PPV and sensitivity), based on a computational experiment using real-world CC records. Conclusions: Our design provides satisfactory performance in classifying Chinese CCs into syndromic categories for public health surveillance. The overall design of our system also points out a potentially fruitful direction for multilingual CC systems that need to handle languages beyond English and Chinese.
AB - Purpose: Syndromic surveillance is aimed at early detection of disease outbreaks. An important data source for syndromic surveillance is free-text chief complaints (CCs), which may be recorded in different languages. For automated syndromic surveillance, CCs must be classified into predefined syndromic categories to facilitate subsequent data aggregation and analysis. Despite the fact that syndromic surveillance is largely an international effort, existing CC classification systems do not provide adequate support for processing CCs recorded in non-English languages. This paper reports a multilingual CC classification effort, focusing on CCs recorded in Chinese. Methods: We propose a novel Chinese CC classification system leveraging a Chinese-English translation module and an existing English CC classification approach. A set of 470 Chinese key phrases was extracted from about one million Chinese CC records using statistical methods. Based on the extracted key phrases, the system translates Chinese text into English and classifies the translated CCs to syndromic categories using an existing English CC classification system. Results: Compared to alternative approaches using a bilingual dictionary and a general-purpose machine translation system, our approach performs significantly better in terms of positive predictive value (PPV or precision), sensitivity (recall), specificity, and F measure (the harmonic mean of PPV and sensitivity), based on a computational experiment using real-world CC records. Conclusions: Our design provides satisfactory performance in classifying Chinese CCs into syndromic categories for public health surveillance. The overall design of our system also points out a potentially fruitful direction for multilingual CC systems that need to handle languages beyond English and Chinese.
KW - Communicable disease control
KW - Medical records
KW - Multilingual chief complaint classification
KW - Statistical pattern extraction
KW - Syndromic surveillance
UR - http://www.scopus.com/inward/record.url?scp=62849086564&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=62849086564&partnerID=8YFLogxK
U2 - 10.1016/j.ijmedinf.2008.08.004
DO - 10.1016/j.ijmedinf.2008.08.004
M3 - Article
C2 - 18838292
AN - SCOPUS:62849086564
SN - 1386-5056
VL - 78
SP - 308
EP - 320
JO - International Journal of Medical Informatics
JF - International Journal of Medical Informatics
IS - 5
ER -