-
Notifications
You must be signed in to change notification settings - Fork 246
Description
Binary referenced in ticket: happy star paints carefully
.
During analysis we identify pointers to data in functions and create data variables, this step will visit all MLIL instructions in a function and as such we expect the pointers to be in a simple expression. This however poses an issue for some specific cases where the constant pointer is constructed piecemeal, or relative to some other value.
MLIL:
21 @ 14021499e rax_2 = [(rcx_3 + &__dos_header) + 0x8e6e60].q
22 @ 1402149a6 rcx_4 = [(rcx_3 + &__dos_header) + 0x8e79f0].q
MLIL (Show opcodes):
21 @ 14021499e (MLIL_SET_VAR.q rax_2 = (MLIL_LOAD.q [(MLIL_ADD.q (MLIL_ADD.q (MLIL_VAR.q rcx_3) + (MLIL_CONST_PTR.q &__dos_header)) + (MLIL_CONST.q 0x8e6e60))].q))
22 @ 1402149a6 (MLIL_SET_VAR.q rcx_4 = (MLIL_LOAD.q [(MLIL_ADD.q (MLIL_ADD.q (MLIL_VAR.q rcx_3) + (MLIL_CONST_PTR.q &__dos_header)) + (MLIL_CONST.q 0x8e79f0))].q))
HLIL:
14021499e int64_t rax_2 = (&data_1408e6e60)[rbx]
1402149a6 int64_t rcx_4 = *((rbx << 3) + 0x1408e79f0) // The LHS is folded in from another expr
Your next question might be, why did the data variable at 14021499e
get constructed? Well it has to do with the way pointer sweep operates, at that address there was a value pointing at a function, which pointer sweep will use as a strong indicator of it (the address 14021499e
) being a pointer.
1408e6e60 void* data_1408e6e60 = sub_1402145f0
At 0x1408e79f0
we are not so lucky:

And if we manually make a data variable here:
1408e79f0 int64_t data_1408e79f0 = 0x140ffa900
Our current pointer sweep is conservative in the sense that we track these referrers (1408e79f0
) and wait until 0x140ffa900
is discovered, than if 0x140ffa900
becomes a data variable we will backtrack and construct data variables at locations pointing to it, such as 1408e79f0
. This however means that if 0x140ffa900
never gets identified as a data variable during pointer sweep, we will miss it (assuming no data variable existed prior to pointer sweep obviously).
So what can we do? We can really solve the issue in two ways, either by identifying the data variable during function analysis (likely by simplifying the expression when we check for data variable references), or by improving pointer sweep for cases of non relocatable binaries, likely through some "pointer table" sweep.

There are a few other ways that also might improve the situation, not exactly sure of which is best.
Also see this as effort medium because this by itself is unlikely to occur and we would most likely bundle this with some other refactor to pointer sweep. But you could get away with just improving the data variable identification on the function analysis side.