TY - GEN
T1 - Change Is the Only Constant
T2 - 2024 Findings of the Association for Computational Linguistics, EMNLP 2024
AU - Dumitru, Razvan Gabriel
AU - Clotan, Paul Ioan
AU - Yadav, Vikas
AU - Peteleaza, Darius
AU - Surdeanu, Mihai
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - This paper introduces a novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT.By transitioning from constant to dynamic slicing, our method leverages the newly proposed Layer Redundancy (LR) score, which assesses how much change each layer changes its input by measuring the cosine similarity of the input to the output of the layer.We use this score to prune parts of individual layers based on redundancy in such a way that the average pruned percentage for all layers is a fixed value.We conducted extensive experiments using models like Llama3-8B and Mistral-7B on multiple datasets, evaluating different slicing bases and percentages to determine optimal configurations that balance efficiency and performance.Our findings show that our dynamic slicing approach not only maintains but, in many cases, enhances model performance compared to the baseline established by constant slicing methods.For instance, in several settings, we see performance improvements of up to 5% over the SliceGPT baseline.Additionally, a perplexity decrease by as much as 7% was observed across multiple benchmarks, validating the effectiveness of our method.The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/DynamicSlicing.
AB - This paper introduces a novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT.By transitioning from constant to dynamic slicing, our method leverages the newly proposed Layer Redundancy (LR) score, which assesses how much change each layer changes its input by measuring the cosine similarity of the input to the output of the layer.We use this score to prune parts of individual layers based on redundancy in such a way that the average pruned percentage for all layers is a fixed value.We conducted extensive experiments using models like Llama3-8B and Mistral-7B on multiple datasets, evaluating different slicing bases and percentages to determine optimal configurations that balance efficiency and performance.Our findings show that our dynamic slicing approach not only maintains but, in many cases, enhances model performance compared to the baseline established by constant slicing methods.For instance, in several settings, we see performance improvements of up to 5% over the SliceGPT baseline.Additionally, a perplexity decrease by as much as 7% was observed across multiple benchmarks, validating the effectiveness of our method.The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/DynamicSlicing.
UR - http://www.scopus.com/inward/record.url?scp=85217620159&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85217620159&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.findings-emnlp.579
DO - 10.18653/v1/2024.findings-emnlp.579
M3 - Conference contribution
AN - SCOPUS:85217620159
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
SP - 9912
EP - 9920
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
Y2 - 12 November 2024 through 16 November 2024
ER -