TY - GEN
T1 - ICD Codes are Insufficient to Create Datasets for Machine Learning
T2 - 12th IEEE International Conference on Healthcare Informatics, ICHI 2024
AU - Whitlock, Abigail E.
AU - Leroy, Gondy
AU - Donovan, Fariba M.
AU - Galgiani, John N.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In medicine, machine learning (ML) datasets are often built using the International Classification of Diseases (ICD) codes. As new models are being developed, there is a need for larger datasets. However, ICD codes are intended for billing. We aim to determine how suitable ICD codes are for creating datasets to train ML models. We focused on a rare and common disease using the All of Us database. First, we compared the patient cohort created using ICD codes for Valley fever (coccidioidomycosis, CM) with that identified via serological confirmation. Second, we compared two similarly created patient cohorts for myocardial infarction (MI) patients. We identified significant discrepancies between these two groups, and the patient overlap was small. The CM cohort had 811 patients in the ICD-10 group, 619 patients in the positive-serology group, and 24 with both. The MI cohort had 14,875 patients in the ICD-10 group, 23,598 in the MI laboratory-confirmed group, and 6,531 in both. Demographics, rates of disease symptoms, and other clinical data varied across our case study cohorts.
AB - In medicine, machine learning (ML) datasets are often built using the International Classification of Diseases (ICD) codes. As new models are being developed, there is a need for larger datasets. However, ICD codes are intended for billing. We aim to determine how suitable ICD codes are for creating datasets to train ML models. We focused on a rare and common disease using the All of Us database. First, we compared the patient cohort created using ICD codes for Valley fever (coccidioidomycosis, CM) with that identified via serological confirmation. Second, we compared two similarly created patient cohorts for myocardial infarction (MI) patients. We identified significant discrepancies between these two groups, and the patient overlap was small. The CM cohort had 811 patients in the ICD-10 group, 619 patients in the positive-serology group, and 24 with both. The MI cohort had 14,875 patients in the ICD-10 group, 23,598 in the MI laboratory-confirmed group, and 6,531 in both. Demographics, rates of disease symptoms, and other clinical data varied across our case study cohorts.
KW - ICD Codes
KW - Myocardial Infarction
KW - Valley fever
UR - http://www.scopus.com/inward/record.url?scp=85203679681&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85203679681&partnerID=8YFLogxK
U2 - 10.1109/ICHI61247.2024.00024
DO - 10.1109/ICHI61247.2024.00024
M3 - Conference contribution
AN - SCOPUS:85203679681
T3 - Proceedings - 2024 IEEE 12th International Conference on Healthcare Informatics, ICHI 2024
SP - 129
EP - 134
BT - Proceedings - 2024 IEEE 12th International Conference on Healthcare Informatics, ICHI 2024
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 June 2024 through 6 June 2024
ER -