It’s been in my mind if we are using the software/program or even hardware of a tech company, we can play around, install-unsinstall and more. Then how is it so difficult for someone to “unhide” the source code that the device uses? Technically the code is in the device somewhere hidden in it, so it’s there, but still, it’s almost impossible to obtain the source code. How do they achieve this so no one copies their code?
In: 366
Let me try to get more “for a 5 y.o.” by trying to explain how a processor works and how we (historically) tried to make programming easier (but also make decompiling harder).
You have probably heard that computers “only use 0 and 1”, even though you clearly interact with e.g. your smartphone using text and pictures and other abstract concepts. So this is where we go back to the 0s and 1s… 🙂
A command that your computers CPU (or “processor”) can *actually* understand could be represented e.g. as `10110101`. This means that it should set the electricity to **on** for the first transistor, to **off** for the second, then the next two are **on** again, etc. This is what we call an “opcode” (short for “operation code”).
Of course, the processor is immensely complex, and the commands are also more complex nowadays, but that is basically how a processor command works.
To make their lives easier, programmers first started to use shorter hexadecimal numbers instead of the long binary numbers above (in this case: `B5`) and then they gave them more meaningful names (e.g. `ADD` or `JMP`) and called this “Assembler”, but the principle is still the same: you need a series of numbers that represent the transistor states that the processor should take.
The next step would be to find a way to write something more human-readable and convert (=”compile”) this into such opcodes. The early compiler languages did just that. They let you write something like `let xpos = ypos + 3` and converted this into a set of processor opcodes that could be transcribed as:
> take the value from the memory cell whose address is stored in the variable `ypos`, add a fixed value of 3 to it and then write the result into the memory cell whose address is stored in variable `xpos`.
That is a lot of different opcodes to write, and things get more complicated once you use loops, if-then statements and subroutines.
And when you look at “higher” programming languages (think C++) all the abstraction and encapsulation makes things even more complicated.
Now, going from the machine code (the zeros and ones) to the short mnemonic codes of Assembler is relatively easy. They are basically the same, really. But the further you go from there the more complicated it gets. There might be many reasons why a disassembler finds the opcode to load something from a memory cell, and there can be many reasons to add a fixed value. At some point, the disassembler will have to guess what is the most likely original source code.
But most importantly, the processor code does not contain any variable or function names, no comments and no other information that was present in the high-level source code. It will just give variable names like `a`, `b`, etc. and you will have to figure out what these mean.
Going from a disassembled code to something that you actually understand is a lot of work, and in most cases it is actually easier to just re-write the same thing from scratch.
Disassembling is still useful: for example, understanding how a computer virus works can help to defeat it. Or if you don’t have good documentation for something (let’s say a hardware device), decompiling existing drivers can help you write a new driver (e.g. to provide drivers for another operating system), etc.
But simply reusing the source code that comes from the decompiler is normally not useful.
They call it the *source* code because it’s the *source* of the compiled artifact you actually run; compilation is an inherently lossy process so you can’t really go back (decompile) a compiled artifact into its source. You can decompile it into *something* but it’s not what the programmer actually wrote.
A computer runs completely in binary, every program and application uses 1s and 0s to represent data. Source code gets converted into machine language, the symbols, when you put software into a compiler it gets converted to machine language. This machine language doesn’t resemble any type of human language or mathematics, one of the most difficult ways to program is assembly language, which is one step above machine language, and above assembly are languages like c, and then interpreted languages like python. Encryption is used to protect people’s software, and without an encryption key it’s basically impossible to copy the source code! Encryption is like scrambling the 1s and 0s in an order that won’t compile unless you have the key to figure out how to scramble the numbers. Let’s say I have a word, let’s say REDDIT, well I can scramble it to ERIDTD, which doesn’t make sense unless we have the key that tells us the order of the letters, the key could tell us that 1 is r, 2 is e, ect, some way to convey with the encryption key how to unscramble the letters
This is about information.
A compiler takes a human readable programming language and converts it to machine code, which only computers can read.
This is like taking Moby Dick and compiling it into a Cliff’s Notes version.
You’ve got the same general content and story. But Cliffs Notes is shorter and therefore has less information and substance.
you could take ChatGPT, give it the Cliff’s Notes, and ask it to write Moby Dick. The point being, there are many ways to write that novel. No two reconstructed novels or sets of source code are the same.
Writing good clean code that is easy to maintain and modify is very difficult. Starting with decompiled code and trying to make anything useful is somewhere between impossible and awful. It’s like along you to require page 356 and changing the tone from depressing to cheer, Using just the Cliff notes – which notably don’t tell you anything about what’s on page 356.
The most valuable thing programmers produce is the knowledge of what works and what doesn’t. This is often in the source code as comments. Those are lost when compiled. Also, the elegance of a well written story is lost when you compress it down to just the plot points. Valuable Information is lost.
So, there’s a few things here.
Compiling code turns it from human readable into garbled nonsense that only computers really know what to do with. Imagine you’re reading a book, but all the spaces and vowels have been taken out.
Next, companies often obfuscate their code. While decompiling code is possible, imagine that before compilation, they replace all the “ands” with just an a, all the buts with just a b, and so on.
And then there’s the fact that you might need special software to read specific file types at all.
It’s basically impossible to make code completely impossible to read and copy and understand. However, companies can make it very hard.
If you want to see projects that pick apart hard to decompile and understand code, I would recommend taking a dive into the modding communities of games that aren’t supposed to be modded.
You are mixing up some terms there. It’s source code vs object code and open source vs close source. Object code is created in various ways such by compilers, interpreters and everything in between, runtime libraries, just in time compilers etc. etc. You can get a licence to use the source code or the object code or both. the object code being what you install/run/use. For proprietary software, the source is not usually licensed apart from certain formal business use cases. The licence terms dictate what you can and cannot do legally. If you have a licence to the source you also get instructions on how to build your object and what tools and dependencies etc. are required to do so. So that is the first hurdle, legality. The 2nd is complexity. One high level statement in your source code can translate to many thousands of low level instructions and calls to a cascade of libraries which can be 3rd party and all the way through to the OS and the BIOS then on to the microcode and instructions of the CPU/GPU of the computer it’s running on. If you have the knowledge of machine/assembly language and in-depth knowledge of all the various layers you can follow, make modifications both benign and nefarious, or indeed if you have the time, patience and budget for it, recreate a source of sorts. Depending on the extent companies want to protect their IP or for security there are various schemes of devious complexities that will thwart your efforts. One example being polymorphic code in viruses which can hide and rewrite their code in real time, but that’s for another day.
So if for example a company accidentally releases a game where a debugging menu is available, then if I decompile the program and then using it alongside the debugger, Could i be able to check exactly where in the code I need to focus? But here is my question, after I know where to look, what’s next? Like the program is still decompiled, i just know where to look.
The biggest reason why you cannot get source code from a binary file is that the compiler will remove everything that is unnecessary for the computer to run the program. This includes everything that makes it easier for a programmer to understand what the code does.
Let’s take this one line of code: `totalPixels = resolutionX * resolutionY`
The names of the variables? They are unnecessary for the computer to run, so instead it would look like this: `$a = $b + $c`
So when you try to get the source code again, you no longer have the information what the variables were called. The information that you wanted to get the total number of Pixels is lost.
Source code is written by humans and is therefore human readable. Let me try to show you by example. You can write the following code in C++:
`// Your state sales tax`
`float SALES_TAX_PERCENT = 7.25;`
`// Calculate price with tax`
`double finalPriceWithTax(double price) {`
`return price * (SALES_TAX_PERCENT / 100);`
`}`
I am sure even somebody who doesn’t know how to program can understand it. After compiled, to a computer it becomes instructions and numbers. When you decompile it you will see this:
`L001:`
`.long 1088946176`
`L002(double):`
`push rbp`
`mov rbp, rsp`
`movsd QWORD PTR [rbp-8], xmm0`
`movss xmm0, DWORD PTR L001[rip]`
`movss xmm1, DWORD PTR .LC0[rip]`
`divss xmm0, xmm1`
`cvtss2sd xmm0, xmm0`
`mulsd xmm0, QWORD PTR [rbp-8]`
`movq rax, xmm0`
`movq xmm0, rax`
`pop rbp`
`ret`
`.LC0:`
`.long 1120403456`
Not so easy to understand now, is it? However, humans can still figure it out. Now imagine you intentionally add code which does nothing and start jumping back and forth. Computer can still handle that with easy, but for human to understand it it becomes very difficult. There are even more advanced ways of making your code even less human readable.
You have a lot of explanations of how the compiled binaries are one way or another difficult for humans to read but I would argue that is of lesser concern. In the end, most code is going to be boilerplate and only a tiny part represents the algorithms you might want to steal or the DRM components if you want to copy and redistribute the program.
Indeed we see this, piracy is essentially people reverse engineering the DRM components and copying the program for redistribution without effective DRM. The same is the case for high value algorithms. They can be reverse engineered if they are valuable enough.
The primary reason in the end is perhaps disappointingly simple. It is illegal. And it’s hard enough that the average user can’t do it by themselves. So either it doesn’t get done or someone places themelves under significant risk by committing a crime and then distributing the goods.
Latest Answers