Eilam E.Reversing.Secrets of reverse engineering.2005
.pdf
Decompilation 461
Expressions and Expression Trees
One of the primary differences between assembly language (regardless of the specific platform) and high-level languages is the ability of high-level languages to describe complex expressions. Consider the following C statement for instance.
a = x * 2 + y / (z + 4);
In C this is considered a single statement, but when the compiler translates the program to assembly language it is forced to break it down into quite a few assembly language instructions. One of the most important aspects of the decompilation process is the reconstruction of meaningful expressions from these individual instructions. For this the decompiler’s intermediate representation needs to be able to represent complex expressions that have a varying degree of complexity. This is implemented using expressions trees similar to the ones used by compilers. Figure 13.1 illustrates an expression tree that describes the above expression.
mov
a
add
mul |
|
div |
x |
2 |
y |
|
add |
z |
4 |
Figure 13.1 An expression tree representing the above C high-level expression. The operators are expressed using their IA-32 instruction names to illustrate how such an expression is translated from a machine code representation to an expression tree.
462 Chapter 13
The idea with this kind of tree is that it is an elegant structured representation of a sequence of arithmetic instructions. Each branch in the tree is roughly equivalent to an instruction in the decompiled program. It is up to the decompiler to perform data-flow analysis on these instructions and construct such a tree. Once a tree is constructed it becomes fairly trivial to produce high-level language expressions by simply scanning the tree. The process of constructing expression trees from individual instructions is discussed below in the dataflow analysis section.
Control Flow Graphs
In order to reconstruct high-level control flow information from a low-level representation of a program, decompilers must create a control flow graph (CFG) for each procedure being analyzed. A CFG is a graph representation of the internal flow with a single procedure. The idea with control flow graphs is that they can easily be converted to high-level language control flow constructs such as loops and the various types of branches. Figure 13.2 shows three typical control flow graph structures for an if statement, an if-else statement, and a while loop.
(a) |
(b) |
(c) |
|
|
Figure 13.2 Typical control flow graphs: (a) a simple if statement (b) an if-else statement (c) a while loop.
Decompilation 463
The Front End
Decompiler front ends perform the opposite function of compiler back ends. Compiler back ends take a compiler’s intermediate representation and convert it to the target machine’s native assembly language, whereas decompiler front ends take the same native assembly language and convert it back into the decompiler’s intermediate representation. The first step in this process is to go over the source executable byte by byte and analyze each instruction, including its operands. These instructions are then analyzed and converted into the decompiler’s intermediate representation. This intermediate representation is then slowly improved during the code analysis stage to prepare it for conversion into a high-level language representation by the back end.
Some decompilers don’t actually go through the process of disassembling the source executable but instead require the user to run it through a disassembler (such as IDA Pro). The disassembler produces a textual representation of the source program which can then be read and analyzed by the decompiler. This does not directly affect the results of the decompilation process but merely creates a minor inconvenince for the user.
The following sections discuss the individual stages that take place inside a decompiler’s front end.
Semantic Analysis
A decompiler front end starts out by simply scanning the individual instructions and converting them into the decompiler’s intermediate representation, but it doesn’t end there. Directly translating individual instructions often has little value in itself, because some of these instructions only make sense together, as a sequence. There are many architecture specific sequences that are made to overcome certain limitations of the specific architecture. The front end must properly resolve these types of sequences and correctly translate them into the intermediate representation, while eliminating all of the architecture-specific details.
Let’s take a look at an example of such a sequence. In the early days of the IA-32 architecture, the floating-point unit was not an integral part of the processor, and was actually implemented on a separate chip (typically referred to as the math coprocessor) that had its own socket on the motherboard. This meant that the two instruction sets were highly isolated from one another, which imposed some limitations. For example, to compare two floating-point values, one couldn’t just compare and conditionally branch using the standard conditional branch instructions. The problem was that the math coprocessor
464Chapter 13
couldn’t directly update the EFLAGS register (nowadays this is easy, because the two units are implemented on a single chip). This meant that the result of a floating-point comparison was written into a separate floating-point status register, which then had to be loaded into one of the general-purpose registers, and from there it was possible to test its value and perform a conditional branch. Let’s look at an example.
00401000 |
FLD |
DWORD PTR [ESP+4] |
00401004 |
FCOMP DWORD PTR [ESP+8] |
|
00401008 |
FSTSW AX |
|
0040100A |
TEST AH,41 |
|
0040100D |
JNZ |
SHORT 0040101D |
This snippet loads one floating-point value into the floating-point stack (essentially like a floating-point register), and compares another value against the first value. Because the older FCOMP instruction is used, the result is stored in the floating-point status word. If the code were to use the newer FCOMIP instruction, the outcome would be written directly into EFLAGS, but this is a newer instruction that didn’t exist in older versions of the processor. Because the result is stored in the floating-point status word, you need to somehow get it out of there in order to test the result of the comparison and perform a conditional branch. This is done using the FSTSW instruction, which copies the floating-point status word into the AX register. Once that value is in AX, you can test the specific flags and perform the conditional branch.
The bottom line of all of this is that to translate this sequence into the decompiler’s intermediate representation (which is not supposed to contain any architecture-specific details), the front end must “understand” this sequence for what it is, and eliminate the code that tests for specific flags (the constant 0x41) and so on. This is usually implemented by adding specific code in the front end that knows how to decipher these types of sequences.
Generating Control Flow Graphs
The code generated by a decompiler’s front end is represented in a graph structure, where each code block is called a basic block (BB). This graph structure simply represents the control flow instructions present in the low-level machine code. Each BB ends with a control flow instruction such as a branch instruction, a call, or a ret, or with a label that is referenced by some branch instruction elsewhere in the code (because labels represent a control flow join).
Blocks are defined for each code segment that is referenced elsewhere in the code, typically by a branch instruction. Additionally, a BB is created after every conditional branch instruction, so that a conditional branch instruction
Decompilation 465
can either flow into the BB representing the branch target address or into the BB that contains the code immediately following the condition. This concept is illustrated in Figure 13.3. Note that to improve readability the actual code in Figure 13.3 is shown as IA-32 assembly language code, whereas in most decompilers BBs are represented using the decompiler’s internal instruction set.
|
|
|
|
00401064 |
PUSH EAX |
|
|
|
|
|
|
00401065 |
PUSH 1008 |
|
|
|
|
|
|
0040106A |
PUSH cryptex.00405050 |
|
|
|
|
|
|
0040106F |
PUSH ESI |
|
|
|
|
|
|
00401070 |
CALL [<&KERNEL32.ReadFile>] |
|
|
|
|
|
|
|
|
||
|
|
|
|
||||
|
|
|
|
00401076 |
TEST EAX,EAX |
|
|
|
|
00401078 |
JE SHORT cryptex.004010CB |
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
004010CB |
POP EDI |
|
|
0040107A |
MOV EAX,[ESP+18] |
|
|
004010CC |
XOR EAX,EAX |
|
|
|
|||
|
|
|
TEST EAX,EAX |
|
|||
004010CE |
POP ESI |
|
|
0040107E |
|
||
|
|
|
|
|
|
||
004010CF |
POP ECX |
|
|
00401080 MOV DWORD PTR [ESP+14],1008 |
|
||
|
00401088 |
JE SHORT cryptex.004010C2 |
|
||||
004010D0 |
RETN |
|
|
||||
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
0040108A |
LEA ECX,[ESP+14] |
|
0040108E |
PUSH ECX |
|
0040108F |
PUSH cryptex.00405050 |
|
00401094 |
PUSH 0 |
|
00401096 |
PUSH 1 |
|
00401098 |
PUSH 0 |
|
0040109A |
PUSH EAX |
|
0040109B |
CALL [<&ADVAPI32.CryptDecrypt>] |
|
|
|
|
|
|
|
004010A1 |
TEST EAX,EAX |
|
004010A3 |
JNZ SHORT cryptex.004010C2 |
|
|
|
|
|
|
|
|
|
|
004010A5 |
CALL [<&KERNEL32.GetLastError>] |
|
|
|
|
004010C2 |
POP EDI |
004010C3 |
MOV EAX,cryptex.00405050 |
004010C8 |
POP ESI |
004010C9 |
POP ECX |
004010CA |
RETN |
004010AB |
PUSH EDI |
|
004010AC |
PUSH cryptex.004030E8 |
|
004010B1 |
CALL [<&MSVCR71.printf>] |
|
|
|
|
|
|
|
004010B7 |
ADD ESP,8 |
|
004010BA |
PUSH 1 |
|
004010BC |
CALL [<&MSVCR71.exit>] |
|
|
|
|
Figure 13.3 An unstructured control flow graph representing branches in the original program. The dotted arrows represent conditional branch instructions while the plain ones represent fall-through cases—this is where execution proceeds when a branch isn’t taken.
466 Chapter 13
The control flow graph in Figure 13.3 is quite primitive. It is essentially a graphical representation of the low-level control flow statement in the program. It is important to perform this simple analysis at this early stage in decompilation to correctly break the program into basic blocks. The process of actually structuring these graphs into a representation closer to the one used by high-level languages is performed later, during the control flow analysis stage.
Code Analysis
Strictly speaking, a decompiler doesn’t have an optimizing stage. After all, you’re looking to produce a high-level language representation from a binary executable, and not to “improve” the program in any way. On the contrary, you want the output to match the original program as closely as possible. In reality, this optimizing, or code-improving, phase in a decompiler is where the program is transformed from a low-level intermediate representation to a higher-level intermediate representation that is ready to be transformed into a high-level language code. This process could actually be described as the opposite of the compiler’s optimization process—you’re trying to undo many of the compiler’s optimizations.
The code analysis stage is where much of the interesting stuff happens. Decompilation literature is quite scarce, and there doesn’t seem to be an official term for this stage, so I’ll just name it the code analysis stage, even though some decompiler researchers simply call it the middle-end.
The code analysis stage starts with an intermediate representation of the program that is fairly close to the original assembly language code. The program is represented using an instruction set similar to the one discussed in the previous section, but it still lacks any real expressions. The code analysis process includes data-flow analysis, which is where these expressions are formed, type analysis which is where complex and primitive data types are detected, and control flow analysis, which is where high-level control flow constructs are recovered from the unstructured control flow graph created by the front end. These stages are discussed in detail in the following sections.
Data-Flow Analysis
Data-flow analysis is a critical stage in the decompilation process. This is where the decompiler analyzes the individual, seemingly unrelated machine instructions and makes the necessary connections between them. The connections are created by tracking the flow of data within those instructions and analyzing the impact each individual instruction has on registers and memory
Decompilation 467
locations. The resulting information from this type of analysis can be used for a number of different things in the decompilation process. It is required for eliminating the concept of registers and operations performed on individual registers, and also for introducing the concept of variables and long expressions that are made up of several machine-level instructions. Data-flow analysis is also where conditional codes are eliminated. Conditional codes are easily decompiled when dealing with simple comparisons, but they can also be used in other, less obvious ways.
Let’s look at a trivial example where you must use data-flow analysis in order for the decompiler to truly “understand” what the code is doing. Think of function return values. It is customary for IA-32 code to use the EAX register for passing return values from a procedure to its caller, but a decompiler cannot necessarily count on that. Different compilers might use different conventions, especially when functions are defined as static and the compiler controls all points of entry into the specific function. In such a case, the compiler might decide to use some other register for passing the return value. How does a decompiler know which register is used for passing back return values and which registers are used for passing parameters into a procedure? This is exactly the type of problem addressed by data-flow analysis.
Data-flow analysis is performed by defining a special notation that simplifies this process. This notation must conveniently represent the concept of defining a register, which means that it is loaded with a new value and using a register, which simply means its value is read. Ideally, such a representation should also simplify the process of identifying various points in the code where a register is defined in parallel in two different branches in the control flow graph.
The next section describes SSA, which is a commonly used notation for implementing data-flow analysis (in both compilers and decompilers). After introducing SSA, I proceed to demonstrate areas in the decompilation process where data-flow analysis is required.
Single Static Assignment (SSA)
Single static assignment (SSA) is a special notation commonly used in compilers that simplifies many data-flow analysis problems in compilers and can assist in certain optimizations and register allocation. The idea is to treat each individual assignment operation as a different instance of a single variable, so that x becomes x0, x1, x2, and so on with each new assignment operation. SSA can be useful in decompilation because decompilers have to deal with the way compilers reuse registers within a single procedure. It is very common for procedures that use a large number of variables to use a single register for two or more different variables, often containing a different data type.
468 Chapter 13
One prominent feature of SSA is its support of ϕ-functions (pronounced “fy functions”). ϕ-functions are positions in the code where the value of a register is going to be different depending on which branch in the procedure is taken. ϕ-functions typically take place at the merging point of two or more different branches in the code, and are used for defining the possible values that the specific registers might take, depending on which particular branch is taken. Here is a little example presented in IA-32 code:
mov |
esi1, 0 |
; Define esi1 |
cmp |
eax1, esi1 |
|
jne |
NotEquals |
|
mov |
esi2, 7 |
; Define esi2 |
jmp |
After |
|
NotEquals: |
|
|
mov |
esi3, 3 |
; Define esi3 |
After: |
|
|
esi4 |
= ø(esi2, esi3) |
; Define esi4 |
mov eax2, esi4 |
; Define eax2 |
|
In this example, it can be clearly seen how each new assignment into ESI essentially declares a new logical register. The definitions of ESI2 and ESI3 take place in two separate branches on the control flow graph, meaning that only one of these assignments can actually take place while the code is running. This is specified in the definition of ESI4, which is defined using a ϕ-function as either ESI2 or ESI3, depending on which particular branch is actually taken. This notation simplifies the code analysis process because it clearly marks positions in the code where a register receives a different value, depending on which branches in the control flow graph are followed.
Data Propagation
Most processor architectures are based on register transfer languages (RTL), which means that they must load values into registers in order to use them. This means that the average program includes quite a few register load and store operations where the registers are merely used as temporary storage to enable certain instructions access to data. Part of the data-flow analysis process in a decompiler involves the elimination of such instructions to improve the readability of the code.
Let’s take the following code sequence as an example:
mov |
eax, DWORD PTR _z$[esp+36] |
||
lea |
ecx, |
DWORD |
PTR [eax+4] |
mov |
eax, |
DWORD |
PTR _y$[esp+32] |
cdq |
|
|
|
Decompilation 469
idiv |
ecx |
mov |
edx, DWORD PTR _x$[esp+28] |
lea |
eax, DWORD PTR [eax+edx*2] |
In this code sequence each value is first loaded into a register before it is used, but the values are only used in the context of this sample—the contents of EDX and ECX are discarded after this code sequence (EAX is used for passing the result to the caller).
If you directly decompile the preceding sequence into a sequence of assignment expressions, you come up with the following output:
Variable1 = Param3;
Variable2 = Variable1 + 4;
Variable1 = Param2;
Variable1 = Variable1 / Variable2
Variable3 = Param1;
Variable1 = Variable1 + Variable3 * 2;
Even though this is perfectly legal C code, it is quite different from anything that a real programmer would ever write. In this sample, a local variable was assigned to each register being used, which is totally unnecessary considering that the only reason that the compiler used registers is that many instructions simply can’t work directly with memory operands. Thus it makes sense to track the flow of data in this sequence and eliminate all temporary register usage. For example, you would replace the first two lines of the preceding sequence with:
Variable2 = Param3 + 4;
So, instead of first loading the value of Param3 to a local variable before using it, you just use it directly. If you look at the following two lines, the same principle can be applied just as easily. There is really no need for storing either Param2 nor the result of Param3 + 4, you can just compute that inside the division expression, like this:
Variable1 = Param2 / (Param3 + 4);
The same goes for the last two lines: You simply carry over the expression from above and propagate it. This gives you the following complex expression:
Variable1 = Param2 / (Param3 + 4) + Param1 * 2;
The preceding code is obviously far more human-readable. The elimination of temporary storage registers is obviously a critical step in the decompilation process. Of course, this process should not be overdone. In many cases, registers
470Chapter 13
represent actual local variables that were defined in the original program. Eliminating them might reduce program readability.
In terms of implementation, one representation that greatly simplifies this process is the SSA notation described earlier. That’s because SSA provides a clear picture of the lifespan of each register value and simplifies the process of identifying ambiguous cases where different control flow paths lead to different assignment instructions on the same register. This enables the decompiler to determine when propagation should take place and when it shouldn’t.
Register Variable Identification
After you eliminate all temporary registers during the register copy propagation process, you’re left with registers that are actually used as variables. These are easy to identify because they are used during longer code sequences compared to temporary storage registers, which are often loaded from some memory address, immediately used in an instruction, and discarded. A register variable is typically defined at some point in a procedure and is then used (either read or updated) more than once in the code.
Still, the simple fact is that in some cases it is impossible to determine whether a register originated in a variable in the program source code or whether it was just allocated by the compiler for intermediate storage. Here is a trivial example of how that happens:
int MyVariable = x * 4;
SomeFunc1(MyVariable);
SomeFunc2(MyVariable);
SomeFunc3(MyVariable);
MyVariable++;
SomeFunc4(MyVariable);
In this example the compiler is likely to assign a register for MyVariable, calculate x * 4 into it, and push it as the parameter in the first three function calls. At that point, the register would be incremented and pushed as a parameter for the last function call. The problem is that this is exactly the same code most optimizers would produce for the example that follows as well:
SomeFunc1(x * 4);
SomeFunc2(x * 4);
SomeFunc3(x * 4);
SomeFunc4(x * 4 + 1);
In this case, the compiler is smart enough to realize that x * 4 doesn’t need to be calculated four times. Instead it just computes x * 4 into a register and pushes that value into each function call. Before the last call to SomeFunc4 that register is incremented and is then passed into SomeFunc4, just as in the previous example where the variable was explicitly defined. This is good
