TY - GEN
T1 - Neural Machine Translation for Recovering ASTs from Binaries
AU - Kc, Dharma
AU - Ferra, Tito
AU - Morrison, Clayton T.
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Recovering higher-level abstractions of source code from binaries is an important task underlying malware identification, program verification, debugging, program comparison, vulnerability detection, and helping subject matter experts understand compiled code. Existing approaches to extracting higher-level structures from lower-level binary code rely on hand-crafted rules and generally require great time and effort of domain experts to design and implement. We present Binary2AST, a framework for generating a structured representation of binary code in the form of an abstract syntax tree (AST) using neural machine translation (NMT). We use the Ghidra binary analysis tool to extract assembly instructions from binaries. A tokenized version of these instructions are then translated by our NMT system into a sequence of symbols that represent an AST. The NMT framework uses deep neural network models that can require a lot of training examples. To address this, we have developed a C source code generator for a restricted subset of the C language, from which we can sample an arbitrary number of syntactically correct C source code files that in turn can be used to create a parallel data set suitable for NMT training. We evaluate several variant NMT models on their ability to recover AST representations of the original source code from compiled binaries, where the best-performing attention-based model achieves a BLEU score of 0.99 on our corpus.
AB - Recovering higher-level abstractions of source code from binaries is an important task underlying malware identification, program verification, debugging, program comparison, vulnerability detection, and helping subject matter experts understand compiled code. Existing approaches to extracting higher-level structures from lower-level binary code rely on hand-crafted rules and generally require great time and effort of domain experts to design and implement. We present Binary2AST, a framework for generating a structured representation of binary code in the form of an abstract syntax tree (AST) using neural machine translation (NMT). We use the Ghidra binary analysis tool to extract assembly instructions from binaries. A tokenized version of these instructions are then translated by our NMT system into a sequence of symbols that represent an AST. The NMT framework uses deep neural network models that can require a lot of training examples. To address this, we have developed a C source code generator for a restricted subset of the C language, from which we can sample an arbitrary number of syntactically correct C source code files that in turn can be used to create a parallel data set suitable for NMT training. We evaluate several variant NMT models on their ability to recover AST representations of the original source code from compiled binaries, where the best-performing attention-based model achieves a BLEU score of 0.99 on our corpus.
KW - abstract syntax tree
KW - neural machine translation
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85171551820&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85171551820&partnerID=8YFLogxK
U2 - 10.1109/SEAI59139.2023.10217602
DO - 10.1109/SEAI59139.2023.10217602
M3 - Conference contribution
AN - SCOPUS:85171551820
T3 - 2023 3rd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2023
SP - 80
EP - 85
BT - 2023 3rd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2023
Y2 - 16 June 2023 through 18 June 2023
ER -