VMProtect, Part 3: Optimization and Code Generation

Originally published on August 6th, 2008 on OpenRCE

This is part #3 of a four-part series on VMProtect. The other parts can be found here:

The reader should inspect this unoptimized IR listing before continuing.  In an attempt to keep this entry from becoming unnecessarily long, the example snippets will be small, but for completeness a more thorough running example is linked throughout the text.

We begin by removing the stack machine features of the IR.  Since VMProtect operates on disassembled x86 code, and x86 itself is not a stack machine, this aspect of the protection is unnatural and easily removed.  Here is a 15-line fragment of VMProtect IR.

push Dword(-88)
push esp
push Dword(4)
pop t3
pop t4
t5 = t3 + t4
push t5
push flags t5
pop DWORD Scratch:[Dword(52)]
pop t6
pop t7
t8 = t6 + t7
push t8
push flags t8
pop DWORD Scratch:[Dword(12)]
pop esp

All but two instructions are pushes or pops, and the pushes can be easily matched up with the pops.  Tracking the stack pointer, we see that, for example, t3 = Dword(4).  A simple analysis allows us to "optimize away" the push/pop pairs into assignment statements.  Simply iterate through each instruction in a basic block and keep a stack describing the source of each push.  For every pop, ensure that the sizes match and record the location of the corresponding push.  We wish to replace the pop with an assignment to the popped expression from the pushed expression, as in

t3 = Dword(4)
t4 = esp 
t7 = Dword(-88)

With the stack aspects removed, we are left with a more conventional listing containing many assignment statements.  This optimization substantially reduces the number of instructions in a given basic block (~40% for the linked example) and opens the door for other optimizations.  The newly optimized code is eight lines, roughly half of the original:

t3 = Dword(4)
t4 = esp
t5 = t3 + t4
DWORD Scratch:[Dword(52)] = flags t5
t6 = t5
t7 = Dword(-88)
t8 = t6 + t7
DWORD Scratch:[Dword(12)] = flags t8
esp = t8

A complete listing of the unoptimized IR versus the one with the stack machine features removed is here, which should be perused before proceeding.

Now we turn our attention to the temporary variables and the scratch area.  Recall that the former were not part of the pre-protected x86 code, nor the VMProtect bytecode -- they were introduced in order to ease the IR translation.  The latter is part of the VMProtect bytecode, but was not part of the original pre-protected x86 code.  Since these are not part of the languages we are modelling, we shall eliminate them wholesale.  On a high level, we treat each temporary variable, each byte of the scratch space, and each register as being a variable defined within a basic block, and then eliminate the former two via the compiler optimizations previously discussed.

Looking again at the last snippet of IR, we can see several areas for improvement.  First, consider the variable t6.  It is clearly just a copy of t5, neither of which are redefined before the next use in the assignment to t8.  Copy propagation will replace variable t6 with t5 and eliminate the former.  More generally, t3, t4, and t7 contain either constants or values that are not modified between their uses.  Constant and copy propagation will substitute the assignments to these variables in for their uses and eliminate them.

The newly optimized code is a slender three lines compared to the original 15; we have removed 80% of the IR for the running example.

DWORD Scratch:[Dword(52)] = flags Dword(4) + esp
esp = Dword(4) + esp + Dword(-88)
DWORD Scratch:[Dword(12)] = flags Dword(4) + esp + Dword(-88)

The side-by-side comparison can be found here.

The IR now looks closer to x86, with the exception that the results of computations are being stored in the scratch area, not into registers.  As before, we apply dead-store elimination, copy and constant propagation to the scratch area, removing dependence upon it entirely in the process.  See here for a comparison with the last phase.

Here is a comparison of the final, optimized code against the original x86:

push ebp                          push ebp
ebp = esp                         mov ebp, esp
push Dword(-1)                    push 0FFFFFFFFh
push Dword(4525664)               push 450E60h
push Dword(4362952)               push offset sub_4292C8
eax = DWORD FS:[Dword(0)]         mov eax, large fs:0
push eax                          push eax
DWORD FS:[Dword(0)] = esp         mov large fs:0, esp
eflags = flags esp + Dword(-88)
esp = esp + Dword(-88)            add esp, 0FFFFFFA8h
push ebx                          push ebx
push esi                          push esi
push edi                          push edi
DWORD SS:[Dword(-24) + ebp] = esp mov [ebp-18h], esp
call DWORD [Dword(4590300)]       call dword ptr ds:unk_460ADC
vmreturn Dword(0) + Dword(4638392)

Code generation is an afterthought.

VMProtect, Part 2: Primer on Optimization

Originally published on OpenRCE on August 6th, 2008

This is part #2 of a four-part series on VMProtect. The other parts can be found here:

Basically, VMProtect bytecode and the IR differ from x86 assembly language in four ways:

  1. It's a stack machine;
  2. The IR contains "temporary variables";
  3. It contains what I've called a "scratch" area, upon which computations are performed rather than in the registers;
  4.  In the case of VMProtect Ultra, it's obfuscated (or rather, "de-optimized") in certain ways.

It turns out that removing these four aspects from the IR is sufficient preparation for compilation into sensible x86 code.  We accomplish this via standard compiler optimizations applied locally to each basic block.  In general, there are a few main compiler optimizations used in this process.  The first one is "constant propagation".  Consider the following C code.

int x = 1; 
function(x);

Clearly x will always be 1 when function is invoked:  it is defined on the first line, and is not re-defined before it is used in the following line (the definition in line one "reaches" the use in line two; alternatively, the path between the two is "definition-clear" with respect to x).  Thus, the code can be safely transformed into "function(1)".  If the first line is the only definition of the variable x, then we can replace all uses of x with the integer 1.  If the second line is the only use of the variable x, then we can eliminate the variable.

The next is "constant folding".  Consider the following C code.

int x = 1024; 
function(x*1024);

By the above remarks, we know we can transform the second line into "function(1024*1024);".  It would be silly to generate code that actually performed this multiplication at run-time:  the value is known at compile-time, and should be computed then.  We can replace the second line with "function(1048576);", and in general we can replace any binary operation performed upon constant values with the computed result of that operation.

Similar to constant propagation is "copy propagation", as in the below.

void f(int i)
{
  int j = i;
  g(j);
}

The variable j is merely a copy of the variable i, and so the variable i can be substituted in for j until the point where either is redefined.  Lacking redefinitions, j can be eliminated entirely.

The next optimization is "dead-store elimination".  Consider the following C code.

int y = 2; 
y = 1;

The definition of the variable y on line one is immediately un-done by the one on line two.  Therefore, there is no reason to actually assign the value 2 to the variable; the first store to y is a "dead store", and can be eliminated by the standard liveness-based compiler optimization known as "dead-store elimination", or more generally "dead-code elimination".  

Here's an example from the VMProtect IR.

ecx = DWORD Scratch:[Dword(44)]
ecx = DWORD Scratch:[Dword(20)]

After dead-store-eliminating the first instruction, it turns out no other instructions use Scratch:[Dword(44)], and so its previous definition can be eliminated as well.

VMProtect, Part 1: Bytecode and IR

Originally posted on August 6th, 2008 on OpenRCE

This is part #1 of a four-part series on VMProtect. The other parts can be found here:

The approach I took with ReWolf's x86 Virtualizer is also applicable here, although a more sophisticated compiler is required.  What follows is some preliminary notes on the design and implementation of such a component.  These are not complete details on breaking the protection; I confess to having only looked at a few samples, and I am not sure which protection options were enabled.

As before, we begin by constructing a disassembler for the interpreter.  This is immediately problematic, since the bytecode language is polymorphic.  I have created an IDA plugin that automatically constructs OCaml source code for a bytecode disassembler.  In a production-quality implementation, this should be implemented as a standalone component that returns a closure.

The generated disassembler, then, looks like this:

let disassemble bytearray index =
match (bytearray.(index) land 0xff) with
| 0x0-> (VM__Handler0__PopIntoRegister(0),[index+1])
| 0x1-> (VM__Handler1__PushDwordFromRegister(0),[index+1])
| 0x2-> (VM__Handler2__AddWords,[index+1])
| 0x3-> (VM__Handler3__StoreByteIntoRegister(bytearray.(index+1)),[index+2])
| 0x4-> (VM__Handler0__PopIntoRegister(4),[index+1])
| 0x5-> (VM__Handler1__PushDwordFromRegister(4),[index+1])
| 0x6-> (VM__Handler4__ShrDword,[index+1])
| 0x7-> (VM__Handler5__ReadDword__FromStackSegment,[index+1])
| ...-> ...

Were we to work with the instructions individually in their natural granularity, depicted above, the bookkeeping on the semantics of each would likely prove tedious.  For illustration, compare and contrast handlers #02 and #04.  Both have the same basic pattern:  pop two values (words vs. dwords), perform a binary operation (add vs. shr), push the result, then push the flags.  The current representation of instructions does not express these, or any, similarities.

Handler #02:           Handler #04:
mov ax, [ebp+0]        mov eax, [ebp+0]
sub ebp, 2             mov cl, [ebp+4]
add [ebp+4], ax        sub ebp, 2
pushf                  shr eax, cl
pop dword ptr [ebp+0]  mov [ebp+4], eax
pushf
pop dword ptr [ebp+0]

Therefore, we pull a standard compiler-writer's trick and translate the VMProtect instructions into a simpler, "intermediate" language (hereinafter "IR") which resembles the pseudocode snippets atop the handlers in part zero.  Below is a fragment of that language's abstract syntax.

type size = B | W | D | Q
type temp = int * size
type seg= Scratch | SS | FS | Regular
type irbinop= Add | And | Shl | Shr | MakeQword
type irunop= Neg | MakeByte | TakeHighDword | Flags
type irexpr = 
| Reg of register 
| Temp of int 
| Const of const 
| Deref of seg * irexpr * size 
| Binop of irexpr * irbinop * irexpr 
| Unop of irexpr * irunop

type ir = 
DeclareTemps of temp list
| Assign of irexpr * irexpr
| Push of irexpr
| Pop of irexpr
| Return

A portion of the VMProtect -> IR translator follows; compare the translation for handlers #02 and #04.

let make_microcode = function
VM__Handler0__PopIntoRegister(b) -> [Pop(Deref(Scratch, Const(Dword(zero_extend_byte_dword(b land 0x3C))), D))]
| VM__Handler2__AddWords -> [DeclareTemps([(0, W);(1, W);(2, W)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Temp(2), Binop(Temp(0), Add, Temp(1)));
 Push(Temp(2));
 Push(Unop(Temp(2), Flags))]
| VM__Handler4__ShrDword -> [DeclareTemps([(0, D);(1, W);(2, D)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Temp(2), Binop(Temp(0), Shr, Temp(1)));
 Push(Temp(2));
 Push(Unop(Temp(2), Flags))]
| VM__Handler7__PushESP-> [Push(Reg(Esp))]
| VM__Handler23__WriteDwordIntoFSSegment -> [DeclareTemps([(0, D);(1, D)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Deref(FS, Temp(0), D), Temp(1))]
| (*...*) -> (*...*)

To summarize the process, below is a listing of VMProtect instructions, followed by the assembly code that is executed for each, and to the right is the IR translation.

VM__Handler1__PushDwordFromRegister 32

; Push (Deref (Scratch, Const (Dword 32l), D));

and al, 3Ch ; al = 32
mov edx, [edi+eax]
sub ebp, 4
mov [ebp+0], edx

VM__Handler7__PushESP
; Push (Reg Esp);
mov eax, ebp
sub ebp, 4
mov [ebp+0], eax

VM__Handler0__PopIntoRegister 40
; Pop (Deref (Scratch, Const (Dword 40l), D));
and al, 3Ch
mov edx, [ebp+0]
add ebp, 4
mov [edi+eax], edx

VM__Handler19__PushSignedByteAsDword (-1l)
; Push (Const (Dword (-1l))); 
movzx eax, byte ptr [esi] ; *esi = -1
sub esi, 0FFFFFFFFh
cbw
cwde
sub ebp, 4
mov [ebp+0], eax

VM__Handler9__PushDword 4525664l
; Push (Const (Dword 4525664l));
mov eax, [esi] ; *esi = 4525664l
add esi, 4
sub ebp, 4
mov [ebp+0], eax

VM__Handler9__PushDword 4362952l};
; Push (Const (Dword 4362952l));
mov eax, [esi] ; *esi = 4362952l
add esi, 4
sub ebp, 4
mov [ebp+0], eax

VM__Handler19__PushSignedByteAsDword 0l};
; Push (Const (Dword (0l))); 
movzx eax, byte ptr [esi] ; *esi = 0
sub esi, 0FFFFFFFFh
cbw
cwde
sub ebp, 4
mov [ebp+0], eax

VM__Handler42__ReadDwordFromFSSegment};
;Pop (Temp 0);
;Push (Deref (FS, Temp 0, D));
mov eax, [ebp+0]DeclareTemps([(0,D)])
mov eax, fs:[eax]
mov [ebp+0], eax

VMProtect, Part 0: Basics

Originally published on August 6th, 2008 on OpenRCE

This is part #0 of a four-part series on VMProtect. The other parts can be found here:

VMProtect is a virtualization protector.  Like other protections in the genre, among others ReWolf's x86 Virtualizer and CodeVirtualizer, it works by disassembling the x86 bytecode of the target executable and compiling it into a proprietary, polymorphic bytecode which is executed in a custom interpreter at run-time.  This is unlike the traditional notions of packing, in which the x86 bytecode is simply encrypted and/or compressed:  with virtualization, the original x86 bytecode in the protected areas is gone, never to be seen again.  Or so the idea goes.

If you've never looked at VMProtect before, I encourage you to take a five-minute look in IDA (here's a sample packed binary).  As far as VMs go, it is particularly skeletal and easily comprehended.  The difficulty lies in recreating working x86 bytecode from the VM bytecode.  Here's a two-minute analysis of its dispatcher.

push edi; push all registers
push ecx
push edx
push esi
push ebp
push ebx
push eax
push edx
pushf
push 0 ; imagebase fixup
mov esi, [esp+8+arg_0] ; esi = pointer to VM bytecode
mov ebp, esp ; ebp = VM's "stack" pointer
sub esp, 0C0h
mov edi, esp ; edi = "scratch" data area

VM__FOLLOW__Update:
add esi, [ebp+0]

VM__FOLLOW__Regular:
mov al, [esi]; read a byte from EIP
movzx eax, al
sub esi, -1; increment EIP
jmp ds:VM__HandlerTable[eax*4] ; execute instruction handler

A feature worth discussing is the "scratch space", referenced by the register edi throughout the dispatch loop.  This is a 16-dword-sized area on the stack where VMProtect saves the registers upon entering the VM, modifies them throughout the course of a basic block, and from whence it restores the registers upon exit.  For each basic block protected by the VM, the layout of the registers in the scratch space can potentially be different.

Here's a disassembly of some instruction handlers.  Notice that A) VMProtect is a stack machine and that B) each handler -- though consisting of scant few instructions -- performs several tasks, e.g. popping several values, performing multiple operations, pushing one or more values.

#00:x = [EIP-1] & 0x3C; y = popd; [edi+x] = y

.text:00427251 and al, 3Ch; al = instruction number
.text:00427254 mov edx, [ebp+0] ; grab a dword off the stack
.text:00427257 add ebp, 4 ; pop the stack
.text:0042725A mov [edi+eax], edx ; store the dword in the scratch space

#01:x = [EIP-1] & 0x3C; y = [edi+x]; pushd y

.vmp0:0046B0EB and al, 3Ch; al = instruction number
.vmp0:0046B0EE mov edx, [edi+eax] ; grab a dword out of the scratch space
.vmp0:0046B0F1 sub ebp, 4 ; subtract 4 from the stack pointer
.vmp0:0046B0F4 mov [ebp+0], edx ; push the dword onto the stack

#02:x = popw, y = popw, z = x + y, pushw z, pushf

.text:004271FB mov ax, [ebp+0] ; pop a word off the stack
.text:004271FF sub ebp, 2
.text:00427202 add [ebp+4], ax ; add it to another word on the stack
.text:00427206 pushf
.text:00427207 pop dword ptr [ebp+0] ; push the flags

#03:x = [EIP++]; w = popw; [edi+x] = Byte(w)

.vmp0:0046B02A movzx eax, byte ptr [esi] ; read a byte from EIP
.vmp0:0046B02D mov dx, [ebp+0] ; pop a word off the stack
.vmp0:0046B031 inc esi ; EIP++
.vmp0:0046B032 add ebp, 2; adjust stack pointer
.vmp0:0046B035 mov [edi+eax], dl ; write a byte into the scratch area

#04:x = popd, y = popw, z = x << y, pushd z, pushf

.vmp0:0046B095 mov eax, [ebp+0]; pop a dword off the stack
.vmp0:0046B098 mov cl, [ebp+4] ; pop a word off the stack
.vmp0:0046B09B sub ebp, 2
.vmp0:0046B09E shr eax, cl ; shr the dword by the word
.vmp0:0046B0A0 mov [ebp+4], eax; push the result
.vmp0:0046B0A3 pushf
.vmp0:0046B0A4 pop dword ptr [ebp+0] ; push the flags

#05:x = popd, pushd ss:[x]

.vmp0:0046B5F7 mov eax, [ebp+0]; pop a dword off the stack
.vmp0:0046B5FA mov eax, ss:[eax] ; read a dword from ss
.vmp0:0046B5FD mov [ebp+0], eax; push that dword

Compiler 1, X86 Virtualizer 0

Originally published on April 4th, 2008.  This post won Honorable Mention for Most Innovative Research at the Pwnie Awards 2008.

There are two types of virtual machine software protections:  A) the ones that convert x86 machine code into virtual machine bytecode and execute it at runtime; B) the ones that execute some arbitrary code in a virtual environment.  I've discussed the latter several times in the past, and by now there exists a wealth of literature on that variety.  But breaking the former kind remains an unsolved problem.

In my article I said "basically, reverse engineering a VM with the common tools is like reverse engineering a scripted installer without a script decompiler: it's repetitious, and the high-level details are obscured by the flood of low-level details".  The more I thought about this, the more I realized that the word "basically" is out of place:  virtualizing software protections are programming language interpreters, albeit for weird languages.

Consequently, an idea struck me:  what we want here is not an interpreter, but a compiler to compile the bytecode back into x86 machine code.  I spent a week coding one (~1000 lines) in OCaml to test this theory, and I'm able to report that, indeed, it works.  I chose ReWolf's x86 Virtualizer, a simple target that uses some of the same techniques as the heavy hitters in this area.  Here is a walkthrough of the analysis and recompilation of a small function with one basic block.  The compiler works equally well for arbitrarily-large functions, although that would make this posting unnecessarily long and complicated.

Step -2:  Protect something with the virtualizer.  In this case I just used ReWolf's sample executable itself.

.text:00401896 call ds:GetTickCount
.text:0040189C push eax
.text:0040189D call _srand
.text:004018A2 pop ecx
.text:004018A3 push 0
.text:004018A5 push offset DialogFunc
.text:004018AA push 0
.text:004018AC push 65h
.text:004018AE push [esp+10h+hInstance]
.text:004018B2 call ds:DialogBoxParamA
.text:004018B8 xor eax, eax
.text:004018BA retn 10h

Step -1:  Analyze the virtual machine.  Although this was not strictly necessary in this case because ReWolf provided source code, I decided to ignore it and reverse the VM manually, since you don't always have such niceties.


Step 0:  Break the polymorphism in the instruction set.  I made use of two remarkably ghetto hacks here, one of which may be considered elegant.  To avoid provoking any arms races I'll omit the details.

Step 1:  Disassemble the relevant region into VM bytecode.  In the process, construct a graph in which each vertex is an instruction, and the edges are the flows between them.

.VM:004131D0 db 0C2h, 0C9h, 0C0h, 0BDh, 14h, 0DFh, 63h, 9Ah, 86h, 5Eh, 50h, 30h, 0Bh
.VM:004131D0 db 0Ah, 0C0h, 0C7h, 0CEh, 5Eh, 44h, 0E1h, 0E0h, 0C7h, 0FCh, 0FDh, 12h
.VM:004131D0 db 10h, 50h, 0D8h, 0D2h, 0DBh, 0A6h, 3Dh, 34h, 0C9h, 12h, 0DEh, 0E5h, 4Bh
.VM:004131D0 db 2Ch, 2Eh, 6Eh, 23h, 21h, 27h, 0E2h, 0E5h, 0ECh, 99h, 14h, 13h, 0C2h
.VM:004131D0 db 0E5h, 0F9h, 0FDh, 0F4h, 38h, 14h, 0F7h, 0F0h, 0F9h, 0ABh, 79h, 6, 0D7h
.VM:004131D0 db 0F0h, 8Bh, 88h, 81h, 41h, 87h, 8Ch, 85h, 0F8h, 51h, 9Ah, 26h, 0DFh
.VM:004131D0 db 0CFh, 1Eh, 15h, 75h, 76h, 74h, 6Bh, 98h, 9Dh, 94h, 6Eh, 0Ch, 6Bh, 90h
.VM:004131D0 db 93h, 9Ah, 0Fh

becomes

vertexlist =
[{label = 84; instruction = VMExit 16l};
{label = 81; instruction = LiteralInstruction [|51; 192|]};
{label = 69; instruction = ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l)};
{label = 65; instruction = PushDereferencedTemp};
{label = 57; instruction = AddImmediateToTemp 20l};
{label = 52; instruction = AddRegisterToTemp Esp};
{label = 44; instruction = SetTemp 0l};
{label = 41; instruction = LiteralInstruction [|106; 101|]};
{label = 38; instruction = LiteralInstruction [|106; 0|]};
{label = 27; instruction = ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l)};
{label = 24; instruction = LiteralInstruction [|106; 0|]};
{label = 22; instruction = LiteralInstruction [|89|]};
{label = 14; instruction = X86Call 6471l};
{label = 12; instruction = LiteralInstruction [|80|]};
{label = 0;instruction = ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l)}];
edgelist =
[({contents = {label = 0}},{contents = {label = 12}});
({contents = {label = 12}}, {contents = {label = 14}});
({contents = {label = 14}}, {contents = {label = 22}});
({contents = {label = 22}}, {contents = {label = 24}});
(* Lots and lots of edges removed *)]

Step 2:  Form basic blocks within the instruction-level CFG.  The previous output becomes:

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|]; 
X86Call 6471l; 
LiteralInstruction [|89|];
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
SetTemp 0l; 
AddRegisterToTemp Esp; 
AddImmediateToTemp 20l;
PushDereferencedTemp;
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|]; VMExit 16l|]}];

Step 3:  Optimize the code within the basic block.  The goal is to convert sequences of VM instructions into a new language more conducive to being compiled back into X86.  The optimizer is the most powerful component of my compiler:  it can remove obfuscation automatically simply as a side-effect of being an optimizer (not that ReWolf's has any, but others do), and employs no pattern matching.

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|]; 
X86Call 6471l;
LiteralInstruction [|89|];
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
SyntheticInstruction (Push, Plus (Constant 20l, Register Esp));
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|]; 
VMExit 16l|]}];

Step 4:  Recompile all virtual instructions into x86 machine language.

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|];
RelativeFixupInstruction ([|232; 0; 0; 0; 0|], 6471l, 1l);
LiteralInstruction [|89|]; 
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
LiteralInstruction [|255; 116; 36; 20|];
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|];
LiteralInstruction [|194; 16; 0|]|]}];

Step 5:  Stuff the original bytes back into the binary and perform fixups specified.  If you can convert between hex and decimal in your head, you'll notice that the bytes above correspond to those below, modulo fixups.  For multi-basic-block functions, this is harder, as you have to sequence the blocks and decide between short and long jumps.

.VM:004131D0 FF 15 28 A0 40 00 call ds:GetTickCount
.VM:004131D6 50                push eax
.VM:004131D7 E8 6B E7 FE FF    call loc_401947
.VM:004131DC 59                pop  ecx
.VM:004131DD 6A 00             push 0
.VM:004131DF 68 F0 16 40 00    push offset loc_4016F0
.VM:004131E4 6A 00             push 0
.VM:004131E6 6A 65             push 65h
.VM:004131E8 FF 74 24 14       push dword ptr [esp+14h]
.VM:004131EC FF 15 48 A1 40 00 call ds:DialogBoxParamA
.VM:004131F2 33 C0             xor eax, eax
.VM:004131F4 C2 10 00          retn 10h

Step 6:  Celebrate.  ReWolf's X86 Virtualizer was simple, and surely breaking the harder ones is, well, harder, but I believe that the general principles espoused here should be applicable to the others.


Here is the source code.

Industrial-Grade Binary-Only Profiling and Coverage

Originally published on February 16th, 2008 on OpenRCE

There are a few options for profiling or performing code-coverage analysis on a per-module binary level:

* Run traces (very slow and generate a huge amount of uninteresting data, but it works);
* MSR tracing (strengths and weaknesses remain to be seen, but seems fairly promising);
* BinNavi/CoverIt/PaiMei/presumably Inspector:  put a breakpoint on every function you found in a static disassembly (doesn't work in general; I explained why here)

There are more options rooted in academia, the most practical of which being dynamic binary instrumentation (DBI), the technology behind tools such as valgrind and DynamoRIO.  The inner workings of this technology are very interesting, but they are rather involved and their precise technical details are beyond the scope of this entry.  Informally speaking, they disassemble a basic block, convert the instructions into an intermediate language like the ones you find inside of a compiler, and finally re-compile the IL with the "instrumentation" code baked directly into the new assembly language.  For more information, read the original Ph.D. thesis describing Valgrind and then read the source to libVEX, a component thereof.  Valgrind is slow and linux-only, but DynamoRIO was specifically designed with speed in mind (hence the "Dynamo") and runs on Windows.

Here I present a DynamoRIO extension for code coverage and profiling.  It works on a function-level (although block-level support could be added easily -- the source weighs in at a measly 70 lines in 2kb, so if you want some other feature, just code it), and it can either be a profiler or a code coverage analyzer.  All it does is instrument the code such that each call instruction, direct or indirect, will write its source and target addresses into a file.  This data can then be used for either profiling or code coverage purposes:  simply discard all of the duplicates for the latter, and use the data as-is for the former.  This is just the back-end, but I imagine that this could be easily integrated into PaiMei's front end to provide an industrial-grade coverage and profiling tool.

Strengths of DynamoRIO:
* speed (you might not even notice the slowdown);
* stability (there used to be a commercial security product based on this technology -- it is literally industrial grade);
* trivial to code extensions for (70 lines, 2kb for this simple yet powerful extension).

Weaknesses:
* definitely won't work with self-modifying code
* probably won't work with obfuscated or "self-protecting" code (there's particularly a problem with so-called "pc-relative" addressing, such as call $ / pop ebp).

Studious readers may note that automatic indirect call resolution is exceptionally useful for C++ reverse engineering;  comment out the direct call resolution, recompile, write a quick IDC script to add the x-refs to the disassembly listing, and you've got a killer C++ RE tool.  Credit goes to spoonm for having and implementing this idea initially.

Note that in the six years since this was published, new binary instrumentation tools such as Intel's PIN have emerged with ameliorate some of the weaknesses of the tools described in this post from 2008.

Code Generation Quirk Involving Array Indexing

Originally published on February 13th, 2008 on OpenRCE.

In this post, we shall investigate some strange-looking code generated in the context of an array index.

.text:10002D49 mov eax, [esp+arg_0]
.text:10002D4D lea ecx, [eax-9C40h]
.text:10002D53 cmp ecx, 50h
.text:10002D56 ja short loc_10002D60
.text:10002D58 mov eax, dword ptr ds:(loc_1000EF5B+1)[eax*8]
.text:10002D5F retn
.text:10002D60
.text:10002D60 loc_10002D60:
.text:10002D60 lea edx, [eax-0A029h]
.text:10002D66 cmp edx, 9
.text:10002D69 ja short loc_10002D73
.text:10002D6B mov eax, dword ptr ds:loc_1000D344[eax*8]
.text:10002D72 retn

We don't find any arrays at the locations referenced on lines -D58 and -D6B (in fact we find code) which is unusual:

; First target
.text:1000EF57 movzx eax, word ptr [esi+18h]
.text:1000EF5B loc_1000EF5B: ; DATA XREF: 10002D58
.text:1000EF5B add dword_10065280, eax
.text:1000EF61 xor eax, eax
.text:1000EF63 pop esi
.text:1000EF64 mov esp, ebp
.text:1000EF66 pop ebp

; Second target
.text:1000D342 mov esp, ebp
.text:1000D344 loc_1000D344: ; DATA XREF: 10002D6B
.text:1000D344 pop ebp

Looking closer at the code, the trick lies in the fact that the arrays are not being indexed starting at zero.

.text:10002D58 mov eax, dword ptr ds:(loc_1000EF5B+1)[eax*8] ; <- 0x9C40 <= eax < 0x9C90
.text:10002D6B mov eax, dword ptr ds:loc_1000D344[eax*8] ; <- 0xA029 <= eax < 0xA032

So the first array actually begins at 0x1000EF5B+1+0x9C40*8 == 0x1005D15C, and the second array begins at 0x1000D344+0x0A029*8 == 0x1005D48C.  What happened here is that the pointer expression has been simplified to conform to x86's instruction encoding:

[1005D15Ch + (eax - 0x9C40) * 8] => [1005D15Ch - 4E200h + eax*8] => [1000EF5Ch + eax*8]

This is pretty uncommon; I've only seen it a handful of times in my reversing endeavors over the years.

Compiler Optimizations Regarding Structures

Originally published on January 22nd, 2008 on OpenRCE

Here are some optimizations that I have seen MSVC apply to structure references.  I wish I could give you the real names for these optimizations, but I can't find them in any of my compilers textbooks.  I have a feeling that they're buried away somewhere inside of Randy Allen and Ken Kennedy's incredibly dense tome, "Optimizing Compilers for Modern Architectures".  If anybody knows the real names for these transformations, please speak up.

#1:  Let's say we are accessing multiple entries in a structure that's larger than 80h.  Now as stated in the previous entry, each access to the members situated at >= 0x80 is going to require a dword in the instruction encoding if we generate the "naive" code.  If we instead do:

lea esi, [esi+middle_of_structure_somewhere]
; ...
mov eax, [esi-(middle_of_structure_somewhere - member_offset1)]
mov ebx, [esi+(member_offset2 - middle_of_structure_somewhere)]

We can access more of the structure with the one-byte instruction encoding, if those subtracted quantities are bytes.  The compiler chooses middle_of_structure_somewhere specifically to maximize the number of one-byte references.  This is the same idea behind the "frame pointer delta" stack-frame optimization.

#2:  Let's say we have a loop that accesses two arrays of structures inside of another structure, one array beginning at +1234h, the other beginning at +2234h.  If we emit the "naive" code:

; ecx = loop induction variable
imul ebx, ecx, sizeof(structure1)
imul edx, ecx, sizeof(structure2)
; ...
mov eax, [esi+1234h+ebx+offset_of_member1]
mov edi, [esi+2234h+edx+offset_of_member2]

Then obviously both of these structure displacements are going to require a separate dword in the instruction encoding for 1234h+offset_of_member1 and 2234h+offset_of_member2.  If we instead do:

lea esi, [esi+1234h]
; ...
; ecx = loop induction variable
imul ebx, ecx, sizeof(structure1)
imul edx, ecx, sizeof(structure2)
; ...
mov eax, [esi+ebx+offset_of_member1]
mov edi, [esi+1000h+edx+offset_of_member2]

Then if offset_of_member1 is a byte, it's only going to require a byte in the instruction encoding, thus saving three bytes per reference to the first structure (we can combine the previous optimization to place esi such that the number of one-byte references is maximized).  Alternatively, if more members in the second structure are accessed than those in the first, we'll see:

lea esi, [esi+2234h]
; ...
; ecx = loop induction variable
imul ebx, ecx, sizeof(structure1)
imul edx, ecx, sizeof(structure2)
; ...
mov eax, [esi+ebx+offset_of_member1-1000h]
mov edi, [esi+edx+offset_of_member2]

Once again, the first optimization can also be applied here to choose the optimal placement for esi that maximizes the number of single-byte references.  The multiplications given in the second optimization can also be optimized away into additions.

Byte Search in Structure Recovery

Originally published on January 21st, 2008 on OpenRCE

Here's an old and simple trick that I use extensively while recovering large structures in C++ reversing.  Briefly, the challenges in structure recovery are to determine:

  1. The size of the structure;
  2. The inheritance hierarchy that relates this structure to others;
  3. The location of the members within the structure;
  4. The data types of the members;
  5. All of the locations in the code where the structure is used;
  6. The holistic picture:  the overall purpose of the structure and the contributions of each data member to that.

This entry is concerned with point #5.  Let's assume that we know (#3) where a particular member within a structure is situated.  In order to figure out (#4) its data type and (#6) its purpose, we should inspect (#5) the locations at which this data member is used.  We might get lucky; maybe we'll find something like this:

mov eax, [esi+Structure.field_XYZ]
push eax
push offset fmtstr ; "%s:Loading into memory for emulation\n"
call LoggingFunction
add esp, 8

From this we can infer both the data type (char *) and functionality (it's a pointer to the name of the file that is about to be emulated), and draw a conclusion about the overall structure (that it's probably related to emulation).  Perhaps we won't get as lucky as this scenario, but maybe a more subtle clue is revealed by one of the references.  So, how do we find other locations at which this structure member is being used?

The obvious answer would be to text-search for the phrase "[reg32+0XYZh]", but this method has a few drawbacks:

A)  It's slow;
B)  It relies upon the disassembler properly distinguishing code from data, which is in fact impossible to solve generally due to equivalence with the halting problem (a result of indirect addressing, which is the bread and butter of C++'s implementation of polymorphism via function pointers);
C)  Finding the string above just tells us that field_XYZ in *some* structure is being used, not necessarily our particular structure of interest.

Point C is critical and bears closer inspection; how can we be sure that the results we are finding actually refer to the structure that interests us?  Let's examine the situation for some specific values of XYZ:

Q:  How many structures contain a member defined at XYZ = +0?
A:  All of them.  Therefore if we were to search for [reg32], we could make no guarantees about which structure is actually being used (or even that a structure is being used, period).

Q:  How many structures contain a member defined at XYZ = +4?
A:  Most of them.  The same comment from above applies.

Q:  How many structures contain a member defined at XYZ = +40?
A:  Few of them.  In my experience a program generally contains proportionally very many structures that have size 0x40 or less, and proportionally very few structures larger than that.

Q:  How many structures contain a member defined at XYZ = +X, where X >= 0x80?  X >= 0x100?  X >= 0x1000?  X >= 0x10000?  X >= 0x80 and X is not dword-boundary-aligned?  X >= 0x80 and X is not a multiple of a high power of two?
A:  The larger the structure, the better the chance that the location of its data members are unique, or if not unique, then at least that the structures were derived from a common base class.

The first lesson is that the higher the offset within the structure, the fewer structures are going to have data members defined at that offset, which means that offset searching begins to become feasible for these high-offset data members.  Point C from above is addressed.

On point B, let's briefly look at some characteristics of instruction encoding on x86.  Below are some typical structure references:

8B 16 mov edx, [esi] ; notice that the +0 is not present in the encoding
66 89 42 36 mov [edx+36h], ax ; notice that the +36h is present as a byte in the encoding
8B 8E AC 5F 00 00 mov ecx, [esi+5FACh] ; notice that the +5FACh is present as a dword in the encoding

 

Displacements off of a register that fit into a single signed byte, e.g. [esi-80h] ... [esi+7Fh], are represented with a single signed byte in the instruction's encoding, e.g. the 36h from the above.  But searching for a byte is no good; a single byte could appear in any context.  Displacements off of a register that are outside of this small window, e.g. [esi+80h], are represented with a dword in the instruction's encoding, e.g. the AC 5F 00 00 from the above.  Therefore, any time one of these high-offset structure members is accessed directly, we're going to see an entire dword in the instruction stream that corresponds to the offset of the structure member.  Searching for an entire dword gives much more precise results than searching for a byte.


Now all of the machinery is in place for the real point of this entry.  Suppose we can't figure out field_5FAC's data type or functionality, and we would like to see other references to that member to see if they provide any clues.  We could text search for the regular expression [.*+5FACh], and we would be reasonably sure that we were finding references to our structure of interest, or at least structures from the same family, but it would be slow, and would only find references that were defined as code.

This is where IDA's "binary search" feature, alt-B or Search->Sequence of bytes..., comes in handy.  Enter AC 5F 00 00 into the window.  IDA instantaneously brings up a window with sixty-nine lines of code, all of which have the form "mov reg32, [reg32+5FACh]" or "mov [reg32+5FACh], reg32".  There is one additional result of the form "db 0ACh", which, when the surrounding bytes are turned into code, is revealed to be a structure reference of the aforementioned variety.  None of the results are false positives.

The point of this blog entry was to say that, the larger a structure becomes, the more "unique" the addresses of the members within the structure become, and due to the instruction encoding on x86, we can find all direct references to the high-offset structure members quickly, easily, and with few to no false positives using IDA's binary search feature.