Abstract:
A lot of binary code analysis tools do not work directly with machine instructions, instead relying on an intermediate representation from the binary code. In this paper, we first analyze problems in binary code analysis that benefit from such an IR and construct a list of requirements that an IR suitable for solving these problems must meet. Generally speaking, a universal binary analysis platform requires two principal components. The first component is a retargetable instruction decoder that utilizes external specifications for describing target instruction sets. External specifications facilitate maintainability and allow for quickly adding support for new instruction sets. We analyze some of the more common ISAs, including those used in microcontrollers, and from that produce a list of requirements for a retargetable decoder. We then survey existing multi-ISA decoders and propose our vision of a more generic approach, based on a multi-layered directed acyclic graph describing the decoding process in universal terms. The second component of an analysis platform is the actual architecture-neutral IR. In this paper we describe such existing IRs, and propose Pivot 2, an IR that is low-level enough to be easily constructed from decoded machine instructions, and at the same time is also easy to analyze. The main features of Pivot 2 are explicit side effects, SSA variables, a simpler alternative to phi-functions, and an extensible elementary operation set at the core. The IR also supports machines that have multiple memory address spaces. Finally, we propose a way to tie the decoder and the IR together to fit them to most binary code analysis tasks through abstract interpretation on top of the IR. The proposed scheme takes into account various aspects of target architectures that are overlooked in many other works, including pipeline specifics (handling of delay slots, hardware loop support, etc.), exception and interrupt management, and a generic address space model where accesses may have arbitrary side effects due to memory-mapped devices or other non-trivial behavior of the memory system.