How do people reverse-engineer compiled applications to get the source code?

25 views

I know the long answer to this question would probably be the equivalent of a college course, but can you summarise how tech people do this?

If you open game.exe with a text editor you’re just going to get what looks like a scrambled mess of characters, so how would one convert this into readable source code?

In: 5

It’s not “a scrambled mess of characters”, it’s the specific machine code of the computer processor it will run on. That’s known by the reverse-engineer, and they certainly have existing tools that can translate that “Windows on Intel” executable into assembly code. They it’s simply a matter of pattern matching the assembly code to higher level language constructs. Most reverse engineers read assembly code pretty well, so the goal of mapping 90% to C source, for example, is just to make the code take fewer pages to print out.

You get a scrambled mess of characters because the text editor tries to decode the contents of the file as texy, which it isn’t. It’s like trying to read Chinese when you don’t know the language.

The contents of the file are (among other things) instructions written in machine code. You can view the file using a program called a “disassembler”, which converts these into assembly code. Assembly is basically a human readable equivalent of machine code – instead of a bunch of 0s and 1s, you get commands like ADD, MUL and JMP.

You have tools that convert assembly tool into readable code, but the complicated part is that every variable name, function name and everything that help a human understand code is lost.
Imagine if someone only give you step by step mathematical instructions to calculate pi. If you don’t know that this specific part is made to calculate pi, and you only have the basic mathematical operation to figure out what is does, it would be very complicated for you.
You don’t need to read assembly code but it’s like a puzzle but with variables and functions.

What you get if you open it with Notepad is just the ASCII representation of the bytes of machine code. In the same way that opening a picture file with Notepad will just look like mangled garbage, so will an application.

However, there is still a pattern to that machine code. If you write and compile the same code, you get the same machine code out of it. It is possible to do this process backwards (decompiling). However, there are some things that don’t get preserved.

For example, if I write code that says something like `var numberOfRetries = 5`, that *numberOfRetries* name for my variable isn’t important in the final product, so it gets discarded during compilation (and the compiler just knows to use the actual memory addresses/etc instead). If you run that code through a decompiler, you would just get something like `var a = 5`, and you would have to, through context, figure out what `a` actually does.

So it becomes a puzzle of figuring out what everything means. You have to use contextual clues to figure things out. Sometimes this is relatively easy (for example, if you see `log.debug(“Retries left: {}”, a)`, you might realize that “*a*” represents the number of retries you have left. You build on this knowledge and rebuild all those variable names and you can figure out what’s happening.

That mess of characters has meaning, as evidenced by the computer knowing how to run it. Converting it back is just a matter of converting the coded data back to a human-readable form. This is no different than how you can convert sounds to letters and understand it either way. The jumbled mess of characters is no different that how this block of text is a jumbled mess of characters to somebody that cannot read English, or Japanese text is a jumbled mess for somebody that cannot read Japanese.

The computer prefers binary instructions like “0x90” (1 byte when stored in a code), while a human will have an easier time reading it as “NOP”, which means “no operation” or “do nothing”, typically implemented as a meaningless operation like X+0. All of the other operations a computer can perform are similarly encoded, such as “0x05 0x00 0x10” is “ADD AX 0x0010” or “x = x + 16” (I may have the order wrong and that may be “x = x + 4096” instead). For the vast majority of processors, you can look up the encodings online as they are publicly provided by the chip maker – the examples I gave are for x86/x64 (desktop Intel/AMD processors).

That very basic form is known as disassembly, and produces readable code that is very verbose.

From there, you can process the instructions for common patterns to make even more readable code. The process is not perfect, however. The computer has no need to know the names of variables or functions, and so those are almost never saved in the binary. To get those, you either have to figure them out from what the code does or when it is called, or you need separate debugging data. Optimizations will also make a mess of the code by moving and changing it, sometimes very significantly, which makes the whole process much harder.

As a final note, some coding will not be fully compiled. Languages like Javascript (heavily used on webpages) is plain text that is only compiled right before it is run. Python (commonly used for desktop and scientific scripting) can be compiled ahead of time, but keeps a lot of debugging information with the code for ease – often it is distributed as plain text as well.

computers “think” in binary or energy =1 or no energy =0. the structure of the chip makes it happen that thoses states create certain results. you can compare it to a huge system of floodgates that let water through or not. in reality its a little more complcate. certain results have got certain names to make it understandable for humans. this is called assembler or assembly. everything that happens in your computer happens at this level. it would be realy complicated to write your code in this language (i did it by myself) so some clever guys used those assembly commands to create higher languages that are easier understandable and create results that are oriented more on how humans think. there is one disadvantage, those languages are slower, so if you want the code to run faster, you translate it into assembly and optimize the code by deleting unnecessary code. now to reverseengineer code, you have to act like the scientists in jurrasic park. you take parts of the assembly code and compare it to the assembly code that build a code in a higher language. if you find enough elements of the “assembly-dna” of a certain command , you fill it up with the necessary assembly code and tada, you got another line. now because of the fact that this has to follow certain rules, you can write a program that does this automatically.

That scrambled mess is actually code to tell the computer to do stuff. And while you can’t infer the intentions of the code, you can know what it is doing in what order. For example this is what compiling might look like

**Code: (Assign the number 1 to an integer named playerId)**

int playerId = 1;

That gets compiled to assembly language which is a human labelled version of the base machine code. And it might look like this:

**Assembly: (Assign the number 1 to an integer named R0)**

IMM R0, 0x80
LOAD R0, R0
IMM R1, 0x1
STORE R0, R1

That is then converted into binary to be stored in a binary file like an exe file. And that might look like this.

**Machine Code:**

0x 60 00 00 80
0x A4 00 00 00
0x 60 01 00 01
0x 08 00 00 01

Or as you would see in the file something like this:

01100000000000000000000010000000
10100100000000000000000000000000
01100000000000010000000000000001
1000000000000000000000000001

Or:

>0110000000000000000000001000000010100100000000000000000000000000011000000000000100000000000000011000000000000000000000000001

Reverse engineering code, is just doing that process in reverse. Yes we no longer know that R0 was playerId, but we don’t really care, and we can infer that if we look hard enough.

A comparison answer: Computers need very specific instructions, which it holds in assembly/BASIC. If they were driving instructions to drive until a turn and then make it would be something like

* Accelerate in a straight line until 15 m/s has been achieved.
* Maintain velocity until 100 m before turn location.
* Begin decelerating to safe turning speed and maintain until at turn.
* Begin turning to the right in an arc that is 5 m in radius until 90° has been achieved.
* Accelerate until 15 m/s has been achieved.

Though a real driving program would need a bunch more details, but you should get the picture.

Looking at what is happening above, a reverse engineer could make assumptions that the above list is just the instruction “Drive at the speed limit and then take a right turn”.

In computer programs, it is the same principle but applied to programming languages instead. The assembly instructions are all made out of a limited amount of instructions that they can compare to a list, so looking at the assembly (the apparently nonsense characters) allows the engineer to know all of the tiny instructions that the program is doing, and then “reverse” the tiny instructions into bigger instructions that are easier for a human to get the bigger image of what the program is doing.

Aside from disassembling the code, there is another neat trick. You can wrap a program in a way that lets you intercept and monitor calls out of the program and into system and custom libraries. Oftentimes that gives you the essential parts of the code without needing to disassemble and understand the code.

Some things to consider:

1. What you see in the exe file is assembly. These are direct commands to the CPU to do things like add 2 numbers together or send some piece of information to your graphics card. It’s not meant to be read by humans, it’s read by the computer. This is why when you open the exe file it’s a mangled mess of random characters. It’s not encoded as ASCII or UTF, the typical standard encoding formats for human readable text. You need a tool that can do that translation for you, which are easy enough to find. You can also do it manually with a hex editor… But only the insane do that (Yes I consider assembly experts insane, they tend to be a special, albeit brilliant, type.)

2. Decompiling is the term here and this is REALLY complicated. This is because the way humans think and write software is completely different from how computers work. Modern compilers, the software responsible for converting a human readable code to assembly, have literally thousands of rules that will optimize the final binary so that it performs really well on the hardware. This process tends to completely mangle the original source code to a point where there’s very little resemblance with the compiled binary. Compilers are often compared to human translation like between English and Japanese. In reality it’s more like translation from English to dog where the dog probably doesn’t even have a concept of the sun being a constant giant thermonuclear explosion.

To give you an idea of how much the compiler could change the source code, let’s say you have 2 arrays of 2 values each. You write the code to iterate through all values one at a time, but you go through the first value in each array then the second and so on:

> [ a1 a2 ]
> [ b1 b2 ]

Your code goes in this sequence: a1 b1 a2 b2

Most CPUs are optimized to be REALLY good at going through the array like this: a1 a2 b1 b2

Going a1 b1 a2 b2 might be significantly slower because it might force the CPU to miss memory cache and fetch from RAM instead of memory on the chip. We’re talking about hundreds of times slower. The compiler will see this and will do something to make it go a1 a2 b1 b2 without you as the developer knowing. This is a REALLY simple example, some of the rules can get really complicated which makes decompiling a really tricky task without additional information.

And this doesn’t even go into how it can be really hard to tell the difference between concepts like if statements, for loops, and inline functions in a language like C/C++ in the final assembly. All of these tend to turn into a bunch of GOTO statements in assembly where it tells the CPU to execute code in a certain location in the EXE. Software based decompilers are generally good about picking these out but they’re far from foolproof, especially when you throw in some of the crazier optimizations compilers do.

Lots of answers here already, but to extend upon them with what you may have seen in some videos online.

Not every “.exe” is machine code.

Many development platforms use a special kind of in-between layer that does actually make some level of “unscrambling” possible (e.g. Unity, Flash, Java, Android)

Everything people have said above/below is generally right (TheLuminary had a particularly good explanation [here](https://www.reddit.com/r/explainlikeimfive/comments/vbhi3e/eli5_how_do_people_reverseengineer_compiled/ic8a77a/)), with two additional details.

Sometimes it’s not the computer reading these instructions, but instead another program, and sometimes it’s not just the .exe file that has this information, but also some related files (or symbols) which might have an extension like .pdb.

When you see people “reverse engineer” some program and get perfectly readable code, chances are they’re reversing a program written on one of these platforms which either doesn’t compile things down to 1s and 0s, or they have the “symbols” which are basically a big list of variable names and the address at which those names are used.

To extend on /u/TheLuminary’s answer, we can pretty much always get back to something like:

IMM R0, 0x80
LOAD R0, R0
IMM R1, 0x1
STORE R0, R1

Those symbol files however give us the information that says:

> `R0` on line 1 is named `playerId`

Without that, when we come back the other way, we end up with:

int int_0 = 1;

but *with* the symbol data, we can get:

int playerId = 1;

There is also software packages out there, e.g. IDA Pro, that get you to the `int_0` stage, and then allow you to go through and rename things to make things readable again if you don’t have the symbols. Generally they do this by effectively writing their own symbols database that you get to manually populate by hand 🤣

[This is a really really long Twitter thread](https://mobile.twitter.com/Foone/status/1536053690368348160) – that’s just how foone likes to write.

This is a step-by-step live walkthrough of decompiling and reverse-engineering the old game SkiFree. They start by using a few programs to take the exe and turn it back into code – completely unreadable code, but still code. The decompiler can take the machine code from the exe and generate C code that matches up with it, but none of the functions or variables are going to have useful names. The decompiler then gives you tools to start organizing and renaming the decompiled code until it’s clear.
[Here](https://twitter.com/Foone/status/1536061110662533125?s=20&t=l99fGoot0_XjoXn0Xn9f9A) they find a function that the decompiler just called “FUN_00404950”. They look at it and see that it takes a piece of text as an input, does some checks on it, then tells Windows to display a message box with that text. So they change the name of this function to “DisplayMessageBox”.

Now, they start looking around for parts of the program that call DisplayMessageBox – can we figure out what _those_ are doing? Frequently, you look at the way things are formatted or the exact bits of text used – [here](https://twitter.com/Foone/status/1536067991518912512?s=20&t=l99fGoot0_XjoXn0Xn9f9A) is some code that makes a bit of text that looks like “number:number:number.number”. If you look at the game, you notice that the player’s time is written that way: “hours:minutes:seconds.fraction”. So this function that generates that text is probably displaying the player time.

Two functions down, a few hundred to go…