How do people reverse-engineer compiled applications to get the source code?

311 views

I know the long answer to this question would probably be the equivalent of a college course, but can you summarise how tech people do this?

If you open game.exe with a text editor you’re just going to get what looks like a scrambled mess of characters, so how would one convert this into readable source code?

In: 5

12 Answers

Anonymous 0 Comments

Some things to consider:

1. What you see in the exe file is assembly. These are direct commands to the CPU to do things like add 2 numbers together or send some piece of information to your graphics card. It’s not meant to be read by humans, it’s read by the computer. This is why when you open the exe file it’s a mangled mess of random characters. It’s not encoded as ASCII or UTF, the typical standard encoding formats for human readable text. You need a tool that can do that translation for you, which are easy enough to find. You can also do it manually with a hex editor… But only the insane do that (Yes I consider assembly experts insane, they tend to be a special, albeit brilliant, type.)

2. Decompiling is the term here and this is REALLY complicated. This is because the way humans think and write software is completely different from how computers work. Modern compilers, the software responsible for converting a human readable code to assembly, have literally thousands of rules that will optimize the final binary so that it performs really well on the hardware. This process tends to completely mangle the original source code to a point where there’s very little resemblance with the compiled binary. Compilers are often compared to human translation like between English and Japanese. In reality it’s more like translation from English to dog where the dog probably doesn’t even have a concept of the sun being a constant giant thermonuclear explosion.

To give you an idea of how much the compiler could change the source code, let’s say you have 2 arrays of 2 values each. You write the code to iterate through all values one at a time, but you go through the first value in each array then the second and so on:

> [ a1 a2 ]
> [ b1 b2 ]

Your code goes in this sequence: a1 b1 a2 b2

Most CPUs are optimized to be REALLY good at going through the array like this: a1 a2 b1 b2

Going a1 b1 a2 b2 might be significantly slower because it might force the CPU to miss memory cache and fetch from RAM instead of memory on the chip. We’re talking about hundreds of times slower. The compiler will see this and will do something to make it go a1 a2 b1 b2 without you as the developer knowing. This is a REALLY simple example, some of the rules can get really complicated which makes decompiling a really tricky task without additional information.

And this doesn’t even go into how it can be really hard to tell the difference between concepts like if statements, for loops, and inline functions in a language like C/C++ in the final assembly. All of these tend to turn into a bunch of GOTO statements in assembly where it tells the CPU to execute code in a certain location in the EXE. Software based decompilers are generally good about picking these out but they’re far from foolproof, especially when you throw in some of the crazier optimizations compilers do.

You are viewing 1 out of 12 answers, click here to view all answers.