Skip to content

Function referenced data variable improvements (pointer sweep) #7203

@emesare

Description

@emesare

Binary referenced in ticket: happy star paints carefully.

During analysis we identify pointers to data in functions and create data variables, this step will visit all MLIL instructions in a function and as such we expect the pointers to be in a simple expression. This however poses an issue for some specific cases where the constant pointer is constructed piecemeal, or relative to some other value.

MLIL:

  21 @ 14021499e  rax_2 = [(rcx_3 + &__dos_header) + 0x8e6e60].q
  22 @ 1402149a6  rcx_4 = [(rcx_3 + &__dos_header) + 0x8e79f0].q

MLIL (Show opcodes):

  21 @ 14021499e  (MLIL_SET_VAR.q rax_2 = (MLIL_LOAD.q [(MLIL_ADD.q (MLIL_ADD.q (MLIL_VAR.q rcx_3) + (MLIL_CONST_PTR.q &__dos_header)) + (MLIL_CONST.q 0x8e6e60))].q))
  22 @ 1402149a6  (MLIL_SET_VAR.q rcx_4 = (MLIL_LOAD.q [(MLIL_ADD.q (MLIL_ADD.q (MLIL_VAR.q rcx_3) + (MLIL_CONST_PTR.q &__dos_header)) + (MLIL_CONST.q 0x8e79f0))].q))

HLIL:

14021499e        int64_t rax_2 = (&data_1408e6e60)[rbx]
1402149a6        int64_t rcx_4 = *((rbx << 3) + 0x1408e79f0) // The LHS is folded in from another expr

Your next question might be, why did the data variable at 14021499e get constructed? Well it has to do with the way pointer sweep operates, at that address there was a value pointing at a function, which pointer sweep will use as a strong indicator of it (the address 14021499e) being a pointer.

1408e6e60  void* data_1408e6e60 = sub_1402145f0

At 0x1408e79f0 we are not so lucky:

Image

And if we manually make a data variable here:

1408e79f0  int64_t data_1408e79f0 = 0x140ffa900 

Our current pointer sweep is conservative in the sense that we track these referrers (1408e79f0) and wait until 0x140ffa900 is discovered, than if 0x140ffa900 becomes a data variable we will backtrack and construct data variables at locations pointing to it, such as 1408e79f0. This however means that if 0x140ffa900 never gets identified as a data variable during pointer sweep, we will miss it (assuming no data variable existed prior to pointer sweep obviously).

So what can we do? We can really solve the issue in two ways, either by identifying the data variable during function analysis (likely by simplifying the expression when we check for data variable references), or by improving pointer sweep for cases of non relocatable binaries, likely through some "pointer table" sweep.

Image

There are a few other ways that also might improve the situation, not exactly sure of which is best.

Also see this as effort medium because this by itself is unlikely to occur and we would most likely bundle this with some other refactor to pointer sweep. But you could get away with just improving the data variable identification on the function analysis side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Component: CoreIssue needs changes to the coreCore: MLILIssue involves Medium Level ILEffort: LowIssue should take < 1 weekImpact: MediumIssue is impactful with a bad, or no, workaround

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions