Sunday, October 30, 2011

Disassembly Algorithms


As with any problem in computer science, there are many ways to solve the problem of disassembling executable code. I was wondering how IDA Pro disassembles executables. In general, disassembly can be performed as follows:

  1. We take the bytes one instruction at a time and look the instruction up in a lookup table we have prebuilt(which maps raw byte values to opcode mneumonics). Keep in mind that different architectures can have different instruction lengths and some instruction sets could have variable instruction lengths. In the case of variable instruction lengths, we have to use "context clues" such as reading bytes before/after this instruction to infer what this instruction could be.
  2. The lookup in the table will yield the opcode mneumonic for this instruction.
  3. While we still have more data, go back to step 1.

The actual disassembly process presented above might not be very complex, but the question arises "How do we traverse the executable?" As I presented in a previous blogpost, there are multiple sections within an executable, so one may be unsure about where in the executable to even start disassembling. According to "The IDA Pro Book: The Unofficial Guide" by Chris Eagle, there are 2 main executable traversal algorithms.

Linear Sweep Disassembly

In this algorithm, we start at the beginning of a section, and keep track of the length of each instruction we disassemble. Naturally, the next instruction's address can be computed as follow follows:

Next instruction address = Current instruction address + Current instruction size

As the name implies, this algorithm scans the executable linearly(it does not refer to the Asymptotic Complexity of this algorithm). When call or jmp instructions are encountered, they are decoded as regular instructions, and given no special consideration. Think of this algorithm as instruction-by-instruction decoding. Eagle also mentions that this algorithm is used by popular disassembly engines such as gdb, WinDbg and objdump.
  • Advantages: all the byes in the section is guaranteed to be decoded
  • Disadvantages: ALL the bytes in the section is guaranteed to be decoded(including non-executable data)

Recursive Descent Disassembly

In this algorithm, we use the basic approach in Linear Sweep Disassembly, except we are a little smarter about branching instructions. We decode each instruction that we see, but if we see a jmp, call, or other type of control-flow-manipulating instruction, we also add the target address operand of the instruction to a list for further processing. Once we are finished linearly processing all the instructions in the executable, we go back and repeat the same process starting at each address in the list we created which contains the target addresses of control flow manipulation instructions we passed. One noteworthy point is that addresses that are only known at execution time(such as an address stored in EAX) rather than known those hardcoded into the binary (and lend themselves to static analysis) cannot be added to the list for further processing, because the address is not known. The word "Recursive" in the name refers to the fact that we take both branches of eligible branching statements. If all possible execution paths are modeled as a tree, this would effectively be a Breadth-First-Traversal of the execution tree. Think about this algorithm as one that only processes bytes that the CPU processes. IDA Pro uses this algorithm.
  • Advantages: only disassembles bytes that the CPU encounters, and not extraneous bytes
  • Disadvantages: relatively complex, does not always disassemble the whole executable if there is code which is not called explicitly based on its address.