TY - GEN
T1 - BYTESIZED32
T2 - 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023
AU - Wang, Ruoyao
AU - Todd, Graham
AU - Yuan, Xingdi
AU - Xiao, Ziang
AU - Côté, Marc Alexandre
AU - Jansen, Peter
N1 - Publisher Copyright:
©2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - In this work we investigate the capacity of language models to generate explicit, inter pretable, and interactive world models of sci entific and common-sense reasoning tasks. We operationalize this as a task of generating text games, expressed as hundreds of lines of PYTHON code. To facilitate this task, we introduce BYTESIZED321, a corpus of 32 reasoning-focused text games totalling 20k lines of PYTHON code. We empirically demon strate that GPT-4 can use these games as tem plates for single-shot in-context learning, suc cessfully producing runnable games on unseen topics in 28% of cases. When allowed to self reflect on program errors, game runnability substantially increases to 57%. While evalu ating simulation fidelity is labor intensive, we introduce a suite of automated metrics to assess game fidelity, technical validity, adherence to task specifications, and winnability, showing a high-degree of agreement with expert human ratings. We pose this as a challenge task to spur further development at the juncture of world modeling and code generation.
AB - In this work we investigate the capacity of language models to generate explicit, inter pretable, and interactive world models of sci entific and common-sense reasoning tasks. We operationalize this as a task of generating text games, expressed as hundreds of lines of PYTHON code. To facilitate this task, we introduce BYTESIZED321, a corpus of 32 reasoning-focused text games totalling 20k lines of PYTHON code. We empirically demon strate that GPT-4 can use these games as tem plates for single-shot in-context learning, suc cessfully producing runnable games on unseen topics in 28% of cases. When allowed to self reflect on program errors, game runnability substantially increases to 57%. While evalu ating simulation fidelity is labor intensive, we introduce a suite of automated metrics to assess game fidelity, technical validity, adherence to task specifications, and winnability, showing a high-degree of agreement with expert human ratings. We pose this as a challenge task to spur further development at the juncture of world modeling and code generation.
UR - http://www.scopus.com/inward/record.url?scp=85184822199&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85184822199&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85184822199
T3 - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
SP - 13455
EP - 13471
BT - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
A2 - Bouamor, Houda
A2 - Pino, Juan
A2 - Bali, Kalika
PB - Association for Computational Linguistics (ACL)
Y2 - 6 December 2023 through 10 December 2023
ER -