Arnau Profile picture
22 Apr, 25 tweets, 6 min read
Ok, I had some time to read the new paper on MBA deobfuscation: usenix.org/conference/use…

TL;DR Cool stuff and great contribution, but tarnished by some omissions that make it seem to have a bigger impact and general applicability than it really has, imho.

A thread 🧵
First things first. I don't consider myself an expert, but I'm quite familiar with academic literature regarding MBA (de)obfuscation. This paper constitutes great research, provides a novel proof and offers promising results.
That being said, I think it is necessary to point out some omissions, limitations and/or formal inaccuracies that might not be obvious if one is not familiar with MBA literature. I'll use excerpts from my thesis () and literature to clarify some ideas.
The definition that the paper provides for an MBA expression is what literature defines as a linear MBA expressions, excluding the more general definition usually referred to as polynomial MBA expression. Definitions of Polynomial M...
Thus, from the very beginning we must know that the paper will deal with a subset of MBA expressions, and this fact is not mentioned anywhere*.

*it can be vaguely deduced from a sentence at the very end of the paper, which we will see later.
Next, it appears that it is only addressing obfuscated MBA expressions that have been constructed through MBA rewriting rules, but not through insertion of (usually affine) identities. Mechanisms for obfuscation ...
It is not entirely clear to me that their deobfuscation method would apply directly to obfuscated expressions that involve some insertion of (affine) identities, but there are two indicators that this won't be the case or, at least, does not appear it has been tested.
First, the fact that when addressing Tigress obfuscation in 7.4, the EncodeData method is not included in the testbed (while it is used both in Syntia and QSynth testbeds afaik).
Second, this vague mention at the end of Discussion section: "It is also possible that attackers combine MBA obfuscation with other data encoding techniques to create complex expressions with bitwise and arithmetic operations but does not meet the MBA definition in this paper"
But what are MBA expressions if not "expressions with bitwise and arithmetic operations"?
Essentially, it is saying that what has been defined as an MBA expression is indeed a restricted subset, and that some MBA obfuscation transformations would not be covered by their approach.
Let's get to the core contribution of the paper: a way to reduce n-bit expressions to 1-bit expressions. This is an application that comes from proving the converse part of a known theorem (see section 2.2.3 from my thesis for a more detailed version and proof of sufficiency). Theorem as presented in the...
Essentially this theorem that allows us to derive linear MBA rewriting rules on N-bits starting from a {0,1}-matrix. By proving the converse part of the theorem, the authors of the paper prove that we can actually go the other way: form n-bit to 1-bit expressions (and back). Corollary of previous theor...
And this is really cool. But, based on this, they state that "The two-way feature in current MBA obfuscation implies that any n-bit obfuscated MBA expression can be simplified in 1-bit space. Consequently, the MBA reduction in 1-bit space is equivalent to that in n-bit space".
However, this affirmation is not correct, or at least not entirely precise imho, even restricting our analysis to linear MBA expressions: the underlying theorem provides the mechanism to work around "some" linear MBA expressions, but not necessarily all of them.
Indeed, Eyrolle's thesis already points out that for the theorem to hold, the bitwise operations cannot contain constants other than 0 and -1. Only valid constants in bit...
While not explained further, the reasoning behind it is pretty simple: the only way to map an n-bit constant to a 1-bit constant in a two-way manner is if the n-bit constant is either all zeroes (0) or all ones (-1).
Another thing that surprised me is no reference at all to QSynth paper from @quarkslab folks, which already introduced a kind of offline truth-table approach, in that case for a synthesis-based approach.
Moreover, the paper mentions that "The common limitation of existing deobfuscation efforts is that they treat MBA obfuscation as a black box, rather than investigate the mechanism under the hood".
Which is mostly true, except that QSynth paper already presented an algorithm for sub-AST (sub-expression) reasoning and simplification. Ok, QSynth implementation hasn't been made public afaik, but still.
Finally, a brief comment on the evaluation and results. So, why does it work so well? The first dataset seems to include basic "bit hacks" and MBA expressions from Eyrolles' thesis appendices, which are essentially rewriting rules derived using the aforementioned theorem.
The second dataset appears to be constructed by the authors using the rewriting rules approach that they described, which constitutes the MBA obfuscation approach addressed from a deobfuscation point of view in the paper.
(Take this with a grain of salt for the moment. Waiting for the datasets to be published, but that's my understanding from what is mentioned in the paper).
For the more real-world examples, my guess (and my general observation in the wild) is that *everyone* is using the *same* kind of (linear) MBA rewriting rules for MBA obfuscation because these are the ones easily available publicly and known to be easy to generate.
Don't get me wrong, I'm not trying to undervalue their research achievements, real-world applicability and performance, which is undeniable, but to provide a greater context and background so we can all appreciate and position better the contributions made.
To sum up: great research and promising practical approach to deobfuscate a (large, and even "mainstream") subset of linear MBA expressions. I missed a better / more complete contextualization and clear applicability scope delimitation.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Arnau

Arnau Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @arnaugamez

1 Dec 20
🔥 #AdventOfReversing 1/24 🔥
Get dirty as soon as possible. Don't fall into thinking you are not ready. Sure, you will be confused by many things at first. That's fine! I used to confuse sections and segments when I started. Keep pushing, and things will become clear naturally.
🔥 #AdventOfReversing 2/24 🔥
Get used to (re)name *everything* in your disassembler. You might be able to mentally track data across registers and memory for small crackmes w/ easy control flow, but this does not scale at all. Unclutter your mind. Make your life easier.
🔥 #AdventOfReversing 3/24 🔥
You really want to have some programming foundations, but which languages? I mostly agree with this post by @MalwareTechBlog:

🐍 Python
🏗️ C
⚙️ ASM (different flavors: x86(-64) desktop, ARM mobile...)

Give it a read! 📰
malwaretech.com/2018/03/best-p…
Read 19 tweets
24 Jul 20
📢Just published my Maths & CS BSc thesis:
📄"Code deobfuscation by program synthesis-aided simplification of Mixed Boolean-Arithmetic expressions".

Serves as an intro/review to:
- Code (de)obfuscation
- MBA expressions
- Program synthesis

👉github.com/arnaugamez/tfg Cover of the thesis documentAbstract of the thesis
First @radareorg r2syntia prototype was developed in the context of this work. Now it has already been slightly improved and it is being updated and maintained here, in case you are interested: github.com/arnaugamez/r2s…
There is nothing really new (a part form r2syntia) but I would be glad if this "synthesis" work (pun intended) serves as a first step into the field of code (de)obfuscation and motivates the study of MBA expressions and program synthesis to help in further low-level research.
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(