It’s been in my mind if we are using the software/program or even hardware of a tech company, we can play around, install-unsinstall and more. Then how is it so difficult for someone to “unhide” the source code that the device uses? Technically the code is in the device somewhere hidden in it, so it’s there, but still, it’s almost impossible to obtain the source code. How do they achieve this so no one copies their code?
In: 366
Let me try to get more “for a 5 y.o.” by trying to explain how a processor works and how we (historically) tried to make programming easier (but also make decompiling harder).
You have probably heard that computers “only use 0 and 1”, even though you clearly interact with e.g. your smartphone using text and pictures and other abstract concepts. So this is where we go back to the 0s and 1s… 🙂
A command that your computers CPU (or “processor”) can *actually* understand could be represented e.g. as `10110101`. This means that it should set the electricity to **on** for the first transistor, to **off** for the second, then the next two are **on** again, etc. This is what we call an “opcode” (short for “operation code”).
Of course, the processor is immensely complex, and the commands are also more complex nowadays, but that is basically how a processor command works.
To make their lives easier, programmers first started to use shorter hexadecimal numbers instead of the long binary numbers above (in this case: `B5`) and then they gave them more meaningful names (e.g. `ADD` or `JMP`) and called this “Assembler”, but the principle is still the same: you need a series of numbers that represent the transistor states that the processor should take.
The next step would be to find a way to write something more human-readable and convert (=”compile”) this into such opcodes. The early compiler languages did just that. They let you write something like `let xpos = ypos + 3` and converted this into a set of processor opcodes that could be transcribed as:
> take the value from the memory cell whose address is stored in the variable `ypos`, add a fixed value of 3 to it and then write the result into the memory cell whose address is stored in variable `xpos`.
That is a lot of different opcodes to write, and things get more complicated once you use loops, if-then statements and subroutines.
And when you look at “higher” programming languages (think C++) all the abstraction and encapsulation makes things even more complicated.
Now, going from the machine code (the zeros and ones) to the short mnemonic codes of Assembler is relatively easy. They are basically the same, really. But the further you go from there the more complicated it gets. There might be many reasons why a disassembler finds the opcode to load something from a memory cell, and there can be many reasons to add a fixed value. At some point, the disassembler will have to guess what is the most likely original source code.
But most importantly, the processor code does not contain any variable or function names, no comments and no other information that was present in the high-level source code. It will just give variable names like `a`, `b`, etc. and you will have to figure out what these mean.
Going from a disassembled code to something that you actually understand is a lot of work, and in most cases it is actually easier to just re-write the same thing from scratch.
Disassembling is still useful: for example, understanding how a computer virus works can help to defeat it. Or if you don’t have good documentation for something (let’s say a hardware device), decompiling existing drivers can help you write a new driver (e.g. to provide drivers for another operating system), etc.
But simply reusing the source code that comes from the decompiler is normally not useful.
Latest Answers