TY - GEN
T1 - Empowering Persian LLMs for Instruction Following
T2 - 1st Workshop on Language Models for Low-Resource Languages, LoResLM 2025 - co-located at the 31st International Conference on Computational Linguistics, COLING 2025
AU - Mokhtarabadi, Hojjat
AU - Zamani, Ziba
AU - Maazallahi, Abbas
AU - Manshaei, Mohammad Hossein
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Instruction-tuned large language models have demonstrated remarkable capabilities in following human instructions across various domains. However, their proficiency remains notably deficient in many low-resource languages. To address this challenge, we begin by introducing FarsInstruct: a comprehensive instruction dataset designed to enhance the instruction-following ability of large language models specifically for the Persian language—a significant yet underrepresented language globally. FarsInstruct encompasses a wide range of task types and datasets, each containing a mix of straightforward to complex manual written instructions, as well as translations from the Public Pool of Prompts, ensuring a rich linguistic and cultural representation. Furthermore, we introduce Co-CoLA, a framework designed to enhance the multi-task adaptability of LoRA-tuned models. Through extensive experimental analyses, our study showcases the effectiveness of the FarsInstruct dataset coupled with training by the Co-CoLA framework, in improving the performance of large language models within the Persian context. As of the current writing, FarsInstruct comprises 197 templates across 21 distinct datasets, and we intend to update it consistently, thus augmenting its applicability.
AB - Instruction-tuned large language models have demonstrated remarkable capabilities in following human instructions across various domains. However, their proficiency remains notably deficient in many low-resource languages. To address this challenge, we begin by introducing FarsInstruct: a comprehensive instruction dataset designed to enhance the instruction-following ability of large language models specifically for the Persian language—a significant yet underrepresented language globally. FarsInstruct encompasses a wide range of task types and datasets, each containing a mix of straightforward to complex manual written instructions, as well as translations from the Public Pool of Prompts, ensuring a rich linguistic and cultural representation. Furthermore, we introduce Co-CoLA, a framework designed to enhance the multi-task adaptability of LoRA-tuned models. Through extensive experimental analyses, our study showcases the effectiveness of the FarsInstruct dataset coupled with training by the Co-CoLA framework, in improving the performance of large language models within the Persian context. As of the current writing, FarsInstruct comprises 197 templates across 21 distinct datasets, and we intend to update it consistently, thus augmenting its applicability.
KW - Instruction-tuned LLMs
KW - Low-resource languages
KW - Parameter efficient finetuning
UR - https://www.scopus.com/pages/publications/105000148712
UR - https://www.scopus.com/pages/publications/105000148712#tab=citedBy
M3 - Conference contribution
AN - SCOPUS:105000148712
T3 - Proceedings - International Conference on Computational Linguistics, COLING
SP - 31
EP - 67
BT - LoResLM 2025 - 1st Workshop on Language Models for Low-Resource Languages, Proceedings of the Workshop
A2 - Hettiarachchi, Hansi
A2 - Ranasinghe, Tharindu
A2 - Rayson, Paul
A2 - Mitkov, Ruslan
A2 - Gaber, Mohamed
A2 - Premasiri, Damith
A2 - Tan, Fiona Anting
A2 - Uyangodage, Lasitha
PB - Association for Computational Linguistics (ACL)
Y2 - 20 January 2025
ER -