Eilam E.Reversing.Secrets of reverse engineering.2005
.pdf
P A R T
IV
Beyond Disassembly
C H A P T E R
12
Reversing .NET
This book has so far focused on just one reverse-engineering platform: native code written for IA-32 and compatible processors. Even though there are many programs that fall under this category, it still makes sense to discuss other, emerging development platforms that might become more popular in the future. There are endless numbers of such platforms. I could discuss other operating systems that run under IA-32 such as Linux, or discuss other platforms that use entirely different operating systems and different processor architectures, such as Apple Macintosh. Beyond operating systems and processor architectures, there are also high-level platforms that use a special assembly language of their own, and can run under any platform. These are virtual-machine-based platforms such as Java and .NET.
Even though Java has grown to be an extremely powerful and popular programming language, this chapter focuses exclusively on Microsoft’s .NET platform. There are several reasons why I chose .NET over Java. First of all, Java has been around longer than .NET, and the subject of Java reverse engineering has been covered quite extensively in various articles and online resources. Additionally, I think it would be fair to say that Microsoft technologies have a general tendency of attracting large numbers of hackers and reversers. The reason why that is so is the subject of some debate, and I won’t get into it here.
In this chapter, I will be covering the basic techniques for reverse engineering .NET programs. This requires that you become familiar with some of the
423
424Chapter 12
ground rules of the .NET platform, as well as with the native language of the
.NET platform: MSIL. I’ll go over some simple MSIL code samples and analyze them just as I did with IA-32 code in earlier chapters. Finally, I’ll introduce some tools that are specific to .NET (and to other bytecode-based platforms) such as obfuscators and decompilers.
Ground Rules
Let’s get one thing straight: reverse engineering of .NET applications is an entirely different ballgame compared to what I’ve discussed so far. Fundamentally, reversing a .NET program is an incredibly trivial task. .NET programs are compiled into an intermediate language (or bytecode) called MSIL (Microsoft Intermediate Language). MSIL is highly detailed; it contains far more high-level information regarding the original program than an IA-32 compiled program does. These details include the full definition of every data structure used in the program, along with the names of almost every symbol used in the program. That’s right: The names of every object, data member, and member function are included in every .NET binary—that’s how the .NET runtime (the CLR) can find these objects at runtime!
This not only greatly simplifies the process of reversing a program by reading its MSIL code, but it also opens the door to an entirely different level of reverse-engineering approaches. There are .NET decompilers that can accurately recover a source-code-level representation of most .NET programs. The resulting code is highly readable, both because of the original symbol names that are preserved throughout the program, but also because of the highly detailed information that resides in the binary. This information can be used by decompilers to reconstruct both the flow and logic of the program and detailed information regarding its objects and data types. Figure 12.1 demonstrates a simple C# function and what it looks like after decompilation with the Salamander decompiler. Notice how pretty much every important detail regarding the source code is preserved in the decompiled version (local variable names are gone, but Salamander cleverly names them i and j).
Because of the high level of transparency offered by .NET programs, the concept of obfuscation of .NET binaries is very common and is far more popular than it is with native IA-32 binaries. In fact, Microsoft even ships an obfuscator with its .NET development platform, Visual Studio .NET. As Figure 12.1 demonstrates, if you ship your .NET product without any form of obfuscation, you might as well ship your source code along with your executable binaries.
Reversing .NET 425
Salamander Decompiler Output
public static void Main() { for (int i = 1; i <= 10; i++) { |
for (int j = 1; j <= 10; j++) |
{ |
Console.Write("{0 } ", (i *j)); } Console.WriteLine(""); |
} } |
||||
|
|
|
|
|
|
|
Decompilation |
|
Compilation |
|
IL |
Executable |
Binary |
|
|
||
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Original Function Source Code
Console.Write("{0 } ", x*y); } Console.WriteLine(""); } }
for (y = 1; y <= 10; y++) {
int x, y; for (x = 1; x <= 10; x ++) {
public static void Main() {
Figure 12.1 The original source code and the decompiled version of a simple C# function.
426 Chapter 12
.NET Basics
Unlike native machine code programs, .NET programs require a special environment in which they can be executed. This environment, which is called the
.NET Framework, acts as a sort of intermediary between .NET programs and the rest of the world. The .NET Framework is basically the software execution environment in which all .NET programs run, and it consists of two primary components: the common language runtime (CLR) and the .NET class library. The CLR is the environment that loads and verifies .NET assemblies and is essentially a virtual machine inside which .NET programs are safely executed. The class library is what .NET programs use in order to communicate with the outside world. It is a class hierarchy that offers all kinds of services such as user-interface services, networking, file I/O, string management, and so on. Figure 12.2 illustrates the connection between the various components that together make up the .NET platform.
A .NET binary module is referred to as an assembly. Assemblies contain a combination of IL code and associated metadata. Metadata is a special data block that stores data type information describing the various objects used in the assembly, as well as the accurate definition of any object in the program (including local variables, method parameters, and so on). Assemblies are executed by the common language runtime, which loads the metadata into memory and compiles the IL code into native code using a just-in-time compiler.
Managed Code
Managed code is any code that is verified by the CLR in runtime for security, type safety, and memory usage. Managed code consists of the two basic .NET elements: MSIL code and metadata. This combination of MSIL code and metadata is what allows the CLR to actually execute managed code. At any given moment, the CLR is aware of the data types that the program is dealing with. For example, in conventional compiled languages such as C and C++ data structures are accessed by loading a pointer into memory and calculating the specific offset that needs to be accessed. The processor has no idea what this data structure represents and whether the actual address being accessed is valid or not.
While running managed code the CLR is fully aware of almost every data type in the program. The metadata contains information about class definitions, methods and the parameters they receive, and the types of every local variable in each method. This information allows the CLR to validate operations performed by the IL code and verify that they are legal. For example, when an assembly that contains managed code accesses an array item, the CLR can easily check the size of the array and simply raise an exception if the index is out of bounds.
Reversing .NET 427
Visual Basic .NET |
C# Source Code |
Managed C++ |
J# Source Code |
|
Source Code |
Source Code |
|||
|
|
Visual Basic |
C# Compiler |
Managed C++ |
J# Compiler |
|
.NET Compiler |
Compiler |
|||
(csc.exe) |
(vjc.exe) |
|||
(vbc.exe) |
(cl.exe /CLR) |
|||
|
|
Metadata
Intermediate
Language (IL)
Executable
|
Managed Code Verifier |
|
Garbage |
Common Language Runtime (CLR) |
|
Collector |
||
Just In Time Compiler |
||
|
||
|
(JIT) |
|
|
.NET Framework |
|
|
.NET Class Library |
Operating System
Figure 12.2 Relationship between the common language runtime, IL, and the various
.NET programming languages.
428 Chapter 12
.NET Programming Languages
.NET is not tied to any specific language (other than IL), and compilers have been written to support numerous programming languages. The following are the most popular programming languages used in the .NET environment.
C# C Sharp is the .NET programming language in the sense that it was designed from the ground up as the “native” .NET language. It has a syntax that is similar to that of C++, but is functionally more similar to Java than to C++. Both C# and Java are object oriented, allowing only a single level of inheritance. Both languages are type safe, meaning that they do not allow any misuse of data types (such as unsafe typecasting, and so on). Additionally, both languages work with a garbage collector and don’t support explicit deletion of objects (in fact, no .NET language supports explicit deletion of object—they are all based on garbage collection).
Managed C++ Managed C++ is an extension to Microsoft’s C/C++ compiler (cl.exe), which can produce a managed IL executable from C++ code.
Visual Basic .NET Microsoft has created a Visual Basic compiler for
.NET, which means that they’ve essentially eliminated the old Visual Basic virtual machine (VBVM) component, which was the runtime component in which all Visual Basic programs executed in previous versions of the platform. Visual Basic .NET programs now run using the CLR, which means that essentially at this point Visual Basic executables are identical to C# and Managed C++ executables: They all consist of managed IL code and metadata.
J# J Sharp is simply an implementation of Java for .NET. Microsoft provides a Java-compatible compiler for .NET which produces IL executables instead of Java bytecode. The idea is obviously to allow developers to easily port their Java programs to .NET.
One remarkable thing about .NET and all of these programming languages is their ability to easily interoperate. Because of the presence of metadata that accurately describes an executable, programs can interoperate at the object level regardless of the programming language they are created in. It is possible for one program to seamlessly inherit a class from another program even if one was written in C# and the other in Visual Basic .NET, for instance.
Common Type System (CTS)
The Common Type System (CTS) governs the organization of data types in
.NET programs. There are two fundamental data types: values and references. Values are data types that represent actual data, while reference types represent
Reversing .NET 429
a reference to the actual data, much like the conventional notion of pointers. Values are typically allocated on the stack or inside some other object, while with references the actual objects are typically allocated in a heap block, which is freed automatically by the garbage collector (granted, this explanation is somewhat simplistic, but it’ll do for now).
The typical use for value data types is for built-in data types such as integers, but developers can also define their own user-defined value types, which are moved around by value. This is generally only recommended for smaller data types, because the data is duplicated when passed to other methods, and so on. Larger data types use reference types, because with reference types only the reference to the object is duplicated—not the actual data.
Finally, unlike values, reference types are self-describing, which means that a reference contains information on the exact object type being referenced. This is different from value types, which don’t carry any identification information.
One interesting thing about the CTS is the concept of boxing and unboxing. Boxing is the process of converting a value type data structure into a reference type object. Internally, this is implemented by duplicating the object in question and producing a reference to that duplicated object. The idea is that this boxed object can be used with any method that expects a generic object reference as input. Remember that reference types carry type identification information with them, so by taking an object reference type as input, a method can actually check the object’s type in runtime. This is not possible with a value type. Unboxing is simply the reverse process, which converts the object back to a value type. This is needed in case the object is modified while it is in object form—because boxing duplicates the object, any changes made to the boxed object would not reflect on the original value type unless it was explicitly unboxed.
Intermediate Language (IL)
As described earlier, .NET executables are rarely shipped as native executables.1 Instead, .NET executables are distributed in an intermediate form called Common Intermediate Language (CIL) or Microsoft Intermediate Language (MSIL), but we’ll just call it IL for short. .NET programs essentially have two compilation stages: First a program is compiled from its original source code to IL code, and during execution the IL code is recompiled into native code by the just-in-time compiler. The following sections describe some basic low-level
.NET concepts such as the evaluation stack and the activation record, and introduce the IL and its most important instructions. Finally, I will present a few IL code samples and analyze them.
1It is possible to ship a precompiled .NET binary that doesn’t contain any IL code, and the primary reason for doing so is security-it is much harder to reverse or decompile such an executable. For more information please see the section later in this chapter on the Remotesoft Protector product.
430 Chapter 12
The Evaluation Stack
The evaluation stack is used for managing state information in .NET programs. It is used by IL code in a way that is similar to how IA-32 instructions use registers—for storing immediate information such as the input and output data for instructions. Probably the most important thing to realize about the evaluation stack is that it doesn’t really exist! Because IL code is never interpreted in runtime and is always compiled into native code before being executed, the evaluation stack only exists during the JIT process. It has no meaning during runtime.
Unlike the IA-32 stacks you’ve gotten so used to, the evaluation stack isn’t made up of 32-bit entries, or any other fixed-size entries. A single entry in the stack can contain any data type, including whole data structures. Many instructions in the IL instruction set are polymorphic, meaning that they can take different data types and properly deal with a variety of types. This means that arithmetic instructions, for instance, can operate correctly on either floatingpoint or integer operands. There is no need to explicitly tell instructions which data types to expect—the JIT will perform the necessary data-flow analysis and determine the data types of the operands passed to each instruction.
To properly grasp the philosophy of IL, you must get used to the idea that the CLR is a stack machine, meaning that IL instructions use the evaluation stack just like IA-32 assembly language instruction use registers. Practically every instruction either pops a value off of the stack or it pushes some kind of value back onto it—that’s how IL instructions access their operands.
Activation Records
Activation records are data elements that represent the state of the currently running function, much like a stack frame in native programs. An activation record contains the parameters passed to the current function along with all the local variables in that function. For each function call a new activation record is allocated and initialized. In most cases, the CLR allocates activation records on the stack, which means that they are essentially the same thing as the stack frames you’ve worked with in native assembly language code. The IL instruction set includes special instructions that access the current activation record for both function parameters and local variables (see below). Activation records are automatically allocated by the IL instruction call.
IL Instructions
Let’s go over the most common and interesting IL instructions, just to get an idea of the language and what it looks like. Table 12.1 provides descriptions for some of the most popular instructions in the IL instruction set. Note that the instruction set contains over 200 instructions and that this is nowhere near a
