TY - GEN
T1 - Can Language Models Serve as Text-Based World Simulators?
AU - Wang, Ruoyao
AU - Todd, Graham
AU - Xiao, Ziang
AU - Yuan, Xingdi
AU - Côté, Marc Alexandre
AU - Clark, Peter
AU - Jansen, Peter
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called BYTESIZED32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM’s capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.
AB - Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called BYTESIZED32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM’s capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.
UR - http://www.scopus.com/inward/record.url?scp=85203787562&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85203787562&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.acl-short.1
DO - 10.18653/v1/2024.acl-short.1
M3 - Conference contribution
AN - SCOPUS:85203787562
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 1
EP - 17
BT - Short Papers
A2 - Ku, Lun-Wei
A2 - Martins, Andre F. T.
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
T2 - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Y2 - 11 August 2024 through 16 August 2024
ER -