Neural Machine Translation for Recovering ASTs from Binaries

Dharma Kc, Tito Ferra, Clayton T. Morrison

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recovering higher-level abstractions of source code from binaries is an important task underlying malware identification, program verification, debugging, program comparison, vulnerability detection, and helping subject matter experts understand compiled code. Existing approaches to extracting higher-level structures from lower-level binary code rely on hand-crafted rules and generally require great time and effort of domain experts to design and implement. We present Binary2AST, a framework for generating a structured representation of binary code in the form of an abstract syntax tree (AST) using neural machine translation (NMT). We use the Ghidra binary analysis tool to extract assembly instructions from binaries. A tokenized version of these instructions are then translated by our NMT system into a sequence of symbols that represent an AST. The NMT framework uses deep neural network models that can require a lot of training examples. To address this, we have developed a C source code generator for a restricted subset of the C language, from which we can sample an arbitrary number of syntactically correct C source code files that in turn can be used to create a parallel data set suitable for NMT training. We evaluate several variant NMT models on their ability to recover AST representations of the original source code from compiled binaries, where the best-performing attention-based model achieves a BLEU score of 0.99 on our corpus.

Original languageEnglish (US)
Title of host publication2023 3rd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages80-85
Number of pages6
ISBN (Electronic)9798350337952
DOIs
StatePublished - 2023
Event3rd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2023 - Xiamen, China
Duration: Jun 16 2023Jun 18 2023

Publication series

Name2023 3rd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2023

Conference

Conference3rd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2023
Country/TerritoryChina
CityXiamen
Period6/16/236/18/23

Keywords

  • abstract syntax tree
  • neural machine translation
  • transformer

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Software
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Neural Machine Translation for Recovering ASTs from Binaries'. Together they form a unique fingerprint.

Cite this