TY - GEN
T1 - Lightweight TransUNet with Knowledge Distillation for Efficient Medical Image Segmentation
AU - Al Yusuf, Husain
AU - Biradar, Jayant
AU - Lee, Eung Joo
N1 - Publisher Copyright:
© 2025 SPIE. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Recent advances in computer-assisted interventions and postoperative surgical video analysis have advanced significantly, contributing to improvements in surgical planning, skill assessments, and training. These advances have transformed the surgical landscape by enabling near real-time segmentation of medical images and providing decision support systems that offer critical guidance and assistance to surgeons of all levels of experience. One of the leading deep neural network models used in medical image analysis is TransUnet, which combines the strengths of Transformers and U-Net architecture models. Leveraging this hybrid architecture, TransUNet has achieved superior performance in a variety of medical segmentation tasks. However, its complexity and computational demands, largely inherited from the Transformer model, introduce challenges in terms of high model complexity and inference efficiency. Such challenges limit its deployment in clinical settings that require real-time processing. To address these limitations, we propose an efficient approach that incorporates knowledge distillation alongside a modified architecture of TransUNet. Specifically, we replace the inherited Multi-Head Self-Attention (MHSA) with a Single-Head Self-Attention (SHSA) mechanism to overcome the quadratic computational complexity of the MHSA, and then we train the most optimized lightweight TransUNet (student) model to mimic a high-performing teacher model of the TransUNet through the knowledge distillation process. This scheme effectively reduces the complexity of the student model while maintaining accurate segmentation results, thus enabling real-time performance in clinical settings. In our experiments, we evaluated our approach against the benchmarking Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) dataset and Cataract-1K dataset, demonstrating that our distilled model with SHSA achieves an improved trade-off between accuracy and latency, making it more suitable for practical deployment in surgical environments.
AB - Recent advances in computer-assisted interventions and postoperative surgical video analysis have advanced significantly, contributing to improvements in surgical planning, skill assessments, and training. These advances have transformed the surgical landscape by enabling near real-time segmentation of medical images and providing decision support systems that offer critical guidance and assistance to surgeons of all levels of experience. One of the leading deep neural network models used in medical image analysis is TransUnet, which combines the strengths of Transformers and U-Net architecture models. Leveraging this hybrid architecture, TransUNet has achieved superior performance in a variety of medical segmentation tasks. However, its complexity and computational demands, largely inherited from the Transformer model, introduce challenges in terms of high model complexity and inference efficiency. Such challenges limit its deployment in clinical settings that require real-time processing. To address these limitations, we propose an efficient approach that incorporates knowledge distillation alongside a modified architecture of TransUNet. Specifically, we replace the inherited Multi-Head Self-Attention (MHSA) with a Single-Head Self-Attention (SHSA) mechanism to overcome the quadratic computational complexity of the MHSA, and then we train the most optimized lightweight TransUNet (student) model to mimic a high-performing teacher model of the TransUNet through the knowledge distillation process. This scheme effectively reduces the complexity of the student model while maintaining accurate segmentation results, thus enabling real-time performance in clinical settings. In our experiments, we evaluated our approach against the benchmarking Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) dataset and Cataract-1K dataset, demonstrating that our distilled model with SHSA achieves an improved trade-off between accuracy and latency, making it more suitable for practical deployment in surgical environments.
KW - Attention Mechanism
KW - Knowledge Distillation
KW - Medical Image Analysis
KW - Real-Time Segmentation
KW - TransUNet
UR - https://www.scopus.com/pages/publications/105010535333
UR - https://www.scopus.com/pages/publications/105010535333#tab=citedBy
U2 - 10.1117/12.3054578
DO - 10.1117/12.3054578
M3 - Conference contribution
AN - SCOPUS:105010535333
T3 - Proceedings of SPIE - The International Society for Optical Engineering
BT - Real-Time Image Processing and Deep Learning 2025
A2 - Kehtarnavaz, Nasser
A2 - Shirvaikar, Mukul V.
PB - SPIE
T2 - Real-Time Image Processing and Deep Learning 2025
Y2 - 14 April 2025 through 15 April 2025
ER -