Why is it so difficult to copy source code that is not “open source”?

823 views

It’s been in my mind if we are using the software/program or even hardware of a tech company, we can play around, install-unsinstall and more. Then how is it so difficult for someone to “unhide” the source code that the device uses? Technically the code is in the device somewhere hidden in it, so it’s there, but still, it’s almost impossible to obtain the source code. How do they achieve this so no one copies their code?

In: 366

42 Answers

Anonymous 0 Comments

A big factor here is that a lot of information is lot when source code is compiled and linked into program files. Variable names and function names are replaced by registers, stack offsets, and addresses. Compilers also optimize code and distinctions like a while loop versus a for loop are lost at this level. There are also lots of nuances regarding how to identify and distinguish instructions from data. Some architectures are easier to reverse than others based on the complexity of the instruction set and whether there are fixed widths or alignments for instructions. Some processors (like ARM) can also flip between modes within a program meaning you need additional context to understand instructions. Plus, on top of all of this, some companies intentionally do things in their source or compilation to make it harder for someone to reverse the program to source.

Source: I teach about reversing software at Black Hat

Anonymous 0 Comments

Unlike most answers here, you actually can reverse closed-source code…Though often with work, and not to the exact original state.

There are broadly two kinds of binaries (It’s more complex than this, but this is a brief description):

1. Compiled, native code. Software source code is translated directly into the language that the machine speaks. All humab-readable information is gone. You can technically still read the machine code, but it’s missing the context required to know what things are and what they are doing. There are tools called static analyzers that will recover some of the structure of the original source code, but you lose all of the names and documentation and get a bunch of labels like “FUNC_0001” (Instead of, say “AddTwoNumbers”). It then becomes an incredibly challenging and time-consuming puzzle to reconstruct. This is a process called reverse-engineering. You need to use context clues, and an understanding of the compiler and machine language to make educated guesses until you have recreated mostly equivalent source code.

2. Packaged, interpreted code (I’m including VMs like dotnet and Java in this description, even though it’s not technically correct to call them interpreted). Sometimes source code doesn’t get translated into machine code. It gets translated into some intermediary format and bundled with resources like images and sound into a package. A separate program that was compiled using option #1 for that specific computer will read the code in this package, and execute it. These packages often still have a lot of context information that make it easier to reverse them, though not always. Some programs written in C# can be fed into a program that will spit out almost the exact same source code that went into them, or at least large parts of it. There are ways to obfuscate them, or convert them into something more similar to option #1 to prevent this. Often these formats include a lot of additional information about the source code for metrics and ease of use unless you explicitly tell them not to.

Essentially, when you go from source code to binary, you are stripping the program of most of the information that we as humans put in there to make it intelligible. You can never get all of that information back — some of it is no longer in the program at all (comments, variable names, the file and folder structure of the project) and some of it cannot be perfectly reconstructed (the exact groupings of functions, the line-by-line implementation of some functionality).