TY - GEN
T1 - Multi-disease Classification of CT Reports using Traditional Natural Language Processing and a Lightweight Foundation Model
AU - Garcia-Alcoser, Michael E.
AU - Tushar, Fakrul Islam
AU - Nejad, Mobina Ghojogh
AU - Rubin, Geoffrey D.
AU - Lo, Joseph Y.
N1 - Publisher Copyright:
© 2025 SPIE
PY - 2025
Y1 - 2025
N2 - Natural language processing (NLP) methods can annotate free-text radiology reports to create large datasets at the scale of an entire health system or beyond. Generalizing the disease classification across multiple organ systems inherently requires a complex, robust, and accurate classification model. Concurrently, NLP methods have significantly improved and become more sophisticated. This study compares two traditional NLP methods, a rule-based algorithm (RBA) and a Bidirectional Long Short-Term Memory network (BiLSTM), with a lightweight variant of the Large Language Model Meta AI (Llama) model. Our goal is to analyze the capabilities and limitations of each model in accurately classifying diseases encountered within the chest, abdominal, and pelvic computed tomography (CT) exams of the body. Rule-based algorithms (RBAs) were used to extract disease labels from the “findings” section of CT radiology reports, creating the training, validation, and testing datasets. Disease labels were made for three organ systems: the lungs/pleura, liver/gallbladder, and kidneys/ureters. A BiLSTM network with an attention mechanism was trained on 151,431 cases and tested on 85,987 cases. The BiLSTM and Meta's Llama3.1-8B model was evaluated on the RBA-test set and a manually annotated dataset. On the smaller, manually labeled test set, the RBA model achieved the highest macro F1 score (0.94), followed by the BiLSTM (0.91) and then Llama (0.89). In contrast, on the larger RBA-labeled test set, the BiLSTM maintained high performance (average AUC > 0.98; macro F1 = 0.95), while Llama's macro F1 dropped to 0.65. Manual spot checking of reports where Llama disagreed with RBA/BiLSTM revealed numerous instances in which Llama was actually correct, indicating flaws with the previous RBA labeling. This study emphasizes the limitations of rule-based approaches and the need to consider clinical context in ambiguous scenarios. Llama3.1-8B exhibits the potential to outperform rule-based methods, indicating promise for reliable, large-scale multi-disease classification in CT text reports.
AB - Natural language processing (NLP) methods can annotate free-text radiology reports to create large datasets at the scale of an entire health system or beyond. Generalizing the disease classification across multiple organ systems inherently requires a complex, robust, and accurate classification model. Concurrently, NLP methods have significantly improved and become more sophisticated. This study compares two traditional NLP methods, a rule-based algorithm (RBA) and a Bidirectional Long Short-Term Memory network (BiLSTM), with a lightweight variant of the Large Language Model Meta AI (Llama) model. Our goal is to analyze the capabilities and limitations of each model in accurately classifying diseases encountered within the chest, abdominal, and pelvic computed tomography (CT) exams of the body. Rule-based algorithms (RBAs) were used to extract disease labels from the “findings” section of CT radiology reports, creating the training, validation, and testing datasets. Disease labels were made for three organ systems: the lungs/pleura, liver/gallbladder, and kidneys/ureters. A BiLSTM network with an attention mechanism was trained on 151,431 cases and tested on 85,987 cases. The BiLSTM and Meta's Llama3.1-8B model was evaluated on the RBA-test set and a manually annotated dataset. On the smaller, manually labeled test set, the RBA model achieved the highest macro F1 score (0.94), followed by the BiLSTM (0.91) and then Llama (0.89). In contrast, on the larger RBA-labeled test set, the BiLSTM maintained high performance (average AUC > 0.98; macro F1 = 0.95), while Llama's macro F1 dropped to 0.65. Manual spot checking of reports where Llama disagreed with RBA/BiLSTM revealed numerous instances in which Llama was actually correct, indicating flaws with the previous RBA labeling. This study emphasizes the limitations of rule-based approaches and the need to consider clinical context in ambiguous scenarios. Llama3.1-8B exhibits the potential to outperform rule-based methods, indicating promise for reliable, large-scale multi-disease classification in CT text reports.
KW - BiLSTM
KW - Computed Tomography
KW - Foundation Model
KW - Llama
KW - LLM
KW - Natural Language Processing
KW - Rule-based algorithm
UR - https://www.scopus.com/pages/publications/105004740409
UR - https://www.scopus.com/inward/citedby.url?scp=105004740409&partnerID=8YFLogxK
U2 - 10.1117/12.3047690
DO - 10.1117/12.3047690
M3 - Conference contribution
AN - SCOPUS:105004740409
T3 - Progress in Biomedical Optics and Imaging - Proceedings of SPIE
BT - Medical Imaging 2025
A2 - Wu, Shandong
PB - SPIE
T2 - Medical Imaging 2025: Imaging Informatics
Y2 - 17 February 2025 through 19 February 2025
ER -