Recrutement
Si vous êtes intéressés pour bosser sur des sujets sympas tout en restant loin de Paris, consultez nos offres d'emploi et envoyez nous votre CV à rh@amossys.fr.

Exploring Hex-Rays microcode

During the decompilation process, the Hex-rays decompiler uses Microcode as an intermediate language between the ASM and Pseudocode. As this internal language has yet to be explored in detail, we thought it would be an interesting challenge to dig into it and see how it could help static code retro-engineering. Here are some discussions about the general organization of microcode, followed by the possible applications regarding data flow information.

Methodology

We explored microcode via the IDA Python API. But in the beginning of our work (April 2019), the Hex-Rays Python API to handle microcode was not released yet.

This is why we started our investigations using ctypes to handle C++ structures1. When the microcode API was finally released in July 2019 (ida_hexrays, 7.3 version), we replaced all calls to ctypes libraries by the corresponding Python method. As the Python and C++ API share the same structure, it was a pretty straightforward operation.

As the official documentation for the ida_hexrays Python API is not up-to-date with the new version, we made extensive use of the C++ Hex-Rays documentation (thankfully, both API are overwhelmingly similar).

Introduction to microcode

Microcode is the intermediate language used by the Hex-Rays decompiler.

Its main goal is to turn the result of disassembly into a simpler and processor-independent representation of source code. (For a comprehensive overview of microcode syntax, one might refer to Guilfanov's conferences on microcode2.)

Microcode is generated then optimized during the first steps of decompilation.

At first sight it looks like RISC code:

    mov ebx.4, eax.4 ; 4014FB u=ebx.4 d=eax.4
  • Each microinstruction has one opcode and 3 operands (one or more may be missing depending on the instruction). Notice that the destination operand is at the far right:
    opcode left, right, destination
  • The '4014FB' part is the address of the instruction

The last part is the most interesting for us:

  • the 'u' and 'd' elements represent use- and def-list of instructions. We will discuss these elements on the last part of this post (Dataflow information).

When generated, microcode is very verbose, but a signification portion of microinstructions gets deleted during successive optimization phases.

There are 7 steps of microcode optimization. As the first steps produce verbose codes and incomplete data flow information, we focused on the last two steps: Global Optimization 3 and Local Variable.

The reason we used mainly the 6th optimization level (Global Optimization 3) and not the highest one (Local Variable) is because the Local Variable optimization gets rid of the distinction between stack and register.

Tree-like structure

Generating microcode for a function results in a tree-like structure called mbl_array_t. As its name suggests, mbl_array_t is an array of microblocks, linked to each other according to the control flow graph.

Microblocks are groups of micro-instructions, with the first and last microblocks of a function being always empty.

Tree-like structure of microcode
Figure 1: Tree-like structure of microcode
One aspect worth mentioning is that microinstructions might be nested inside each other. More specifically, one micro-operand can be the result of another instruction, like in the example below:

mov call $"j_?divide"<cdecl:"int a" %j.4, "int b" #7.4>.4, %r.4

Instructions directly linked to a microblock are called top-level instructions, the other are subinstructions.

Micro-Operands

There are 15 different types of micro-operands. Operand types are valuable information, for they allow to easily distinguish between registers, stack variables, memory addresses or numeric constants.

In practice, each operand owns a 't' attribute stating its type, and each operand type holds different kinds of additional information.

Nom

mop.t

Operand type

Useful data

Corresponding attribute

z

0x0

None

Empty operand

r

0x1

Register

Register number

int mop.r

n

0x2

Number constant

Constant value

long mop.nnn.value

t

0x3

String constant

String

String mop.cstr

d

0x4

Result of another instruction

Subinstruction

minsn_t mop.d

S

0x5

Local stack variable

variable offset

long mop.s.off

v

0x6

Global variable

variable address

long mop.g

b

0x7

Microbloc (jump destination)

Microbloc number

int mop.b

f

0x8

List of arguments

List of arguments

List<mop_t> mop.f. args

l

0x9

Local variable[1]

Variable offset

long mop.l.off

a

0xA

Address of operand (l, v, S or r)

Operand

mop_t mop.a

h

0xB

Helper function

Helper (char_p)

String mop.helper

c

0xC

microcases[2]

Not studied

fn

0xD

Floating point constant

Not studied

p

0xE

Operand pair

low and high operand

mop_t mop.pair.lop

mop_t mop.pair.hop

sc

0xF

Scattered

Not studied

Navigating through microcode is appreciably easy. Many features are implemented, such as visitor patterns and find_XX() methods.

Below is a set of useful functions for navigation, with use examples for the not self-explanatory ones :

Table 1 : ida_hexrays API for navigating microcode

Operation

Methode / Attribut

Generate microcode of function

mbl_array_t ida_hexrays.gen_microcode( mba_ranges_t mbr, hexrays_failure_t hf, mlist_t retlist = None, int VRFLAGS = 0, int reqmat = MMAT_GLBOPT3 )

Generate microcode for function at address 'ea'

mbr = ida_hexrays.mba_ranges_t()

fn = idaapi.get_func(ea)

eea = fn.endEA

sea = fn.startEA

mbr.ranges.push_back(ida_range.range_t(sea, eea))

hf = ida_hexrays.hexrays_failure_t()

f = ida_hexrays.DECOMP_WARNINGS

mmat = 7

mba = ida_hexrays.gen_microcode(mbr, hf, None, f, mmat)

mbl_array_t

Object information

Print

void print(vd_printer_t vp)

Print mba

vp = ida_hexrays.vd_printer_t()

mba.print_t(vp)

Entry address

long entry_ea

Number of blocks

int qty

Navigation

First block

mblock_t blocks

Get block i

mblock_t get_mblock(i)

Visit all instructions (including subinstructions)

int for_all_insns(minsn_visitor_t mv)

Visit all top level instructions

int for_all_topinsns(minsn_visitor_t mv)

Visit all operands

int for_all_ops(mop_visitor_t mv)

mblock_t

Object information

Print

void print(vd_printer_t vp)

Print mblock

vp = ida_hexrays.vd_printer_t()

mblock.print_t(vp)

Bloc number

int serial

Navigation

mbl_array_t parent

mbl_array_t mba

Previous bloc (natural order)

mblock_t prevb

Next bloc (natural order)

mblock_t nextb

List of predecessors in CFG

List<mblock_t> predset

List of successor in CFG

List<mblock_t> succset

First instruction

minsn_t head

Last instruction

minsn_t tail

minsn_t

Object Information

Print

String dstr()

Address

int ea

Opcode

int opcode

Navigation

Previous instruction in bloc

minsn_t prev

Next instruction in bloc

minsn_t next

left operand

mop_t l

right operand

mop_t r

destination operand

mop_t d

Visit instruction and subinstructions

int for_all_insns(minsn_visitor_t)

Example of visitor implementation

class minsn_visitor(ida_hexrays.minsn_visitor_t):

def __init__(self):

ida_hexrays.minsn_visitor_t.__init__(self)

 

def visit_minsn(self):

print self.blk.serial # show block number

print self.topins.dstr() # show top-instruction

print curins.dstr() # show current instruction

return 0

 

visitor = minsn_visitor()

mba.for_all_insns(visitor)

Find a (sub)instruction with specific opcode

bool contains_opcode(int opcode)
minsn_t find_opcode(int opcode)

Find a call (sub)instruction

bool contains_call(bool with_helpers=False)
minsn_t find_call(bool with_helpers=False)

mop_t

Object Information

Print

String dstr()

Operand type

int t

Data flow information

Two types of information are available:

• Location: which memory locations are being read or modified by one instruction. This information is stored in microlists (mlist_t class).

• Possible values of registers or data. This information is represented by value ranges (valrng_t class)

It is important to understand that these microlist and value range objects are inherently different: although they both represent sets of values and therefore share common methods, they stand for two entirely distinct concepts.

Microlists

Microlists represent memory locations used or modified by one element of microcode. They can be calculated for a microblock, a microinstruction or even a single micro-operand.

It is a composite class which allows common representation for register, stack and global memory locations.

microlist internals
Figure 2: microlist internals
- Attribute reg (rlist_t class) is a list of virtual microregister (integer). These microregisters are simply identified with a number, and they are mapped to physical registers by the decompiler.

- Attribute mem (ivlset_t class) is a list of memory ranges. Each range (ivl_t class) is characterized by one offset and one size.

Value ranges

Value ranges represent all possible values of an operand. They can refer to an entire microblock or a microinstruction. Internal structure is not publicly available.

Obtaining dataflow information

All data flow information is obtainable through microblocks, even if it refers to microinstructions or operands.

Constant values MAYMUST and VRFLAGS are defined in the second table.

Table 2 : Methods and attributes for exploring data flow

Information

Methode / Attribut

mblock_t

 

Memory zones

In block

Data that must be used by the block

mlist_t mustbuse

Data that may be used by the block

mlist_t maybuse

Data that must be defined by the block

mlist_t mustbdef

 

Data that may be defined by the block

mlist_t maybdef

In one instruction of the block

Used data

mlist_t build_use_list(minsn_t minsn, int MAYMUST)

Defined data

mlist_t build_def_list(minsn_t minsn, int MAYMUST)

In one operand of the block

Used data (return value in mlist)

void append_use_list(mlist_t mlist, mop_t mop, int MAYMUST)

Defined data (return value in mlist)

void append_def_list(mlist_t mlist, mop_t mop, int MAYMUST)

Memory zones values

Find possible values for an instruction

bool get_valranges(valrng_t valrng, vivl_t vivl, minsn_t minsn, int VRFLAGS)

Table 3 : Constants values for data flow manipulation

Constant

Defined values

 

MAYMUST

MUST_ACCESS = 0

MAY_ACCESS = 1

VRFLAGS

VR_AT_START= 0 get value ranges before the instruction or at the block start

VR_AT_END= 1 get value ranges after the instruction or at the block end

get_valranges() in practice

In theory, microblock's get_valranges() method seems like an exciting tool to work with: possible values of one operand are indeed a highly valuable information for static analysis.

Needless to say, we did not expect this function to operate without deficiency: Most of the time, information on memory values is simply not available through static analysis.

We mainly focused our investigation on retrieving value ranges for call arguments, but the method could also apply for other types of instructions.

The goal was to clearly define the limitations and possibilities of the get_valranges() method.

Method arguments

The arguments used by the get_valranges() method requires some explanations:

bool get_valranges(valrng_t valrng, vivl_t vivl, minsn_t minsn, int VRFLAGS)
  • The return Boolean indicates whether the operation went well

  • The calculated value ranges are added to the valrng argument.

Creating an empty valrng_t instance is easy:

valrng = ida_hexrays.valrng_t()
  • The vivl argument represents the memory location we want to know value ranges of.

The vivl_t class represents memory locations (one memory offset and one size). Though it looks very similar to the ivl_t class, one is used for mlist_t and the other for valrng_t, and no direct conversion is possible from one to the other.

The easiest way of instantiating vivl_t is using a micro-operand mop:

vivl = ida_hexrays.vivl_t(mop)
  • minsn argument can be 'None' for considering the whole microblock. Obviously, minsn must belong to the involved microblock.

  • VRFLAGS values are defined above (table 3): they allow calculating the value ranges either just before or right after the instruction.

Operand types

This can seem counter-intuitive, but the get_valranges() method is only made for actual memory location (r, S or v-type operands) : The attempt of calculating value ranges for numeric constant (n-type operands) or local variable (l-type operands) will result in a RuntimeError or an empty valrng_t object.

This has some note-worthy implications:

  • First, get_valranges() is useful only for microcode below 'LOCAL VARIABLES' maturity level (level 7), i.e. without local variable.

  • And second, when trying to evaluate an operand, you need to take into consideration the cases where said operand is a numeric constant, which value is obviously known without the need of get_valranges().

Subinstruction operands

It is perfectly possible to retrieve value ranges of subinstruction operands. However, as mentioned above, the minsn argument of get_valranges() must belong to the microblock, i.e. it must be one of microblock's top instruction.

More specifically, using a subinstruction as argument in get_valranges() will raise a Runtime Error during execution.

The correct way of proceeding is to use a vivl_t corresponding to the desired operands, and then the top-level operand of its instruction.

Calculation pessimism

One thing we realized about value ranges propagation between instructions is that it always considers the worst-case scenario: in practical terms, if one instruction may modify the value of a memory location in an unpredictable way, then all previously calculated value ranges are considered irrelevant after this instruction.

This behavior can be particularly disappointing when working with call instructions or indirect addressing, as they may access and modify any memory location.

To work around this feature, we implemented our own way of propagating value ranges, ignoring may-be-defined elements of instruction (see code below). Our function works as expected, but it only works within a microblock, as we did not find a satisfactory way of propagating results between blocks.

MUST, VR_AT_START = 0
MAY, VR_AT_END = 1
MOP_N = 2

def get_valranges(maymust, mblock, minsn, mop):

    # get valranges of mop before instruction minsn
    valrng = ida_hexrays.valrng_t()

    if (mop.t == MOP_N): # MOP_N = Numeric constant Operand
        val.set_eq(mop.nnn.value)
    else :
        vivl = ida_hexrays.vivl_t(mop)
        if maymust == MAY :
            mblock.get_valranges(valrng, vivl, minsn, VR_AT_START)
        elif maymust == MUST :
            uselist = ida_hexrays.mlist_t()
            mblock.append_use_list(uselist, mop, maymust)
            mblock.get_valranges(valrng, vivl, None, VR_AT_START) # valrange at block entry
            # if mop value is updated inside the microbloc, valrng is updated as well :
            def_insn = find_last_redefinition(uselist, mblock, mblock.head, minsn, maymust)
            if def_insn is not None :
                mblock.get_valranges(valrng, vivl, def_insn, VR_AT_END)
    return val


def find_last_redefinition(mlist, mblock, minsn_start, minsn_end, maymust):

    # find last instruction redefining one element of mlist
    # between minsn_start and minsn_end
    minsn_start = mblock.find_redefinition(mlist, minsn_start, minsn_end, maymust)
    minsn = minsn_start
    while minsn_start is not None :
        minsn = minsn_start
        minsn_start = mblock.find_redefinition(mlist, minsn_start.next, minsn_end, maymust)
    return minsn

Conclusion

In comparison with pseudocode, microcode is easier to navigate through using the ida_hexrays API, especially when considering retrieving call arguments.

Use- and def-list features are working exactly as expected and could be useful for some applications.

In regard to the value ranges feature, the most severe restriction of use concerns value ranges propagation between microblocks. As it is, it does not reveal much valuable information.

On the other hand, value ranges calculations within a block give satisfactory outcomes, although it might have limited practical applications.


  1. A huge thanks to this github project : https://github.com/NeatMonster/MCExplorer 

  2. https://i.blackhat.com/us-18/Thu-August-9/us-18-Guilfanov-Decompiler-Internals-Microcode-wp.pdf 

  3. At 'LOCAL VARIABLES" optimization level, register(r) and local stack variable (S) operands are being replaced by local variable (l) 

  4. c operands are used for representing 'switch' idioms 

  5. At 'LOCAL VARIABLES" optimization level, register(r) and local stack variable (S) operands are being replaced by local variable (l) 

  6. c operands are used for representing 'switch' idioms