As with any problem in
computer science, there are many ways to solve the problem of disassembling
executable code. I was wondering how IDA Pro disassembles executables. In
general, disassembly can be performed as follows:
- We take the bytes one instruction at a time and look the instruction up in a lookup table we have prebuilt(which maps raw byte values to opcode mneumonics). Keep in mind that different architectures can have different instruction lengths and some instruction sets could have variable instruction lengths. In the case of variable instruction lengths, we have to use "context clues" such as reading bytes before/after this instruction to infer what this instruction could be.
- The lookup in the table will yield the opcode mneumonic for this instruction.
- While we still have more data, go back to step 1.
The actual disassembly
process presented above might not be very complex, but the question arises "How
do we traverse the executable?" As I presented in a previous blogpost,
there are multiple sections within an executable, so one may be unsure about
where in the executable to even start disassembling. According to "The IDA
Pro Book: The Unofficial Guide" by Chris Eagle, there are 2 main executable
traversal algorithms.
Linear Sweep
Disassembly
In this algorithm, we
start at the beginning of a section, and keep track of the length of each
instruction we disassemble. Naturally, the next instruction's address can be
computed as follow follows:
Next instruction
address = Current instruction address + Current instruction size
As the name implies,
this algorithm scans the executable linearly(it does not refer to the Asymptotic
Complexity of this algorithm). When call or jmp instructions are encountered,
they are decoded as regular instructions, and given no special consideration. Think
of this algorithm as instruction-by-instruction decoding. Eagle also mentions
that this algorithm is used by popular disassembly engines such as gdb, WinDbg
and objdump.
- Advantages: all the byes in the section is guaranteed to be decoded
- Disadvantages: ALL the bytes in the section is guaranteed to be decoded(including non-executable data)
Recursive Descent
Disassembly
In this algorithm, we
use the basic approach in Linear Sweep Disassembly, except we are a little
smarter about branching instructions. We decode each instruction that we see,
but if we see a jmp, call, or other type of control-flow-manipulating
instruction, we also add the target address operand of the instruction to a list
for further processing. Once we are finished linearly processing all the
instructions in the executable, we go back and repeat the same process starting
at each address in the list we created which contains the target addresses of control
flow manipulation instructions we passed. One noteworthy point is that
addresses that are only known at execution time(such as an address stored in
EAX) rather than known those hardcoded into the binary (and lend themselves to
static analysis) cannot be added to the list for further processing, because
the address is not known. The word "Recursive" in the name refers to
the fact that we take both branches of eligible branching statements. If all
possible execution paths are modeled as a tree, this would effectively be a
Breadth-First-Traversal of the execution tree. Think about this algorithm as
one that only processes bytes that the CPU processes. IDA Pro uses this
algorithm.
- Advantages: only disassembles bytes that the CPU encounters, and not extraneous bytes
- Disadvantages: relatively complex, does not always disassemble the whole executable if there is code which is not called explicitly based on its address.