I’ve heard many times that the reason the Silent Hill remaster collection didn’t turn out so well was because Konami lost the original source code and had to re-create it. But I don’t understand how that is possible. If they were selling copies of Silent Hill, why couldn’t they just take a single disk of it and datamine the source code off of it? How could they possess the game without possessing the game’s source code?
In: 1563
They didn’t lose all the source code, they lost the entire code for the gold version (release version) of the game, and some other parts completely, but they did have some late stage builds still available. There’s some fun dev stories from the people who made the remaster about them looking through old notes from the OG devs to figure out how they fixed bugs they were having to fix again.
When code is compiled, we convert it from the easily readable human format, to an efficient computer format. All of the context is stripped out.
So For humans we write code like “if life equals 20, show low health warning” This makes it easy for us to follow, and see what’s going on. In actual computer language this might be more like “grab memory location 1110100010111, does it equal 100101? if no goto line 101011101.”
So while it’s possible to “decompile” the software on the retail copies, that decompiled copy would be extremely hard to read and follow, and they would basically need to go through each step and re-write it into human readable code, which would be extremely tedious and take more work than just starting over.
The ELI5 answer to “why couldn’t they get the source code from the retail copies” is “you can’t turn hamburger back into a cow”.
To be runnable, most code gets turned into a different form that the machine can run efficiently. This process (compilation) removes most of the information humans need to edit the code.
(Some languages tend to leave more information than others, for example with Java and .NET you even get most variable names and pretty readable code back, but with C++, you’d have a better chance with the cow than the code.)
The problem word in that sentence is “just”. No, they can’t “just” datamine the source code. Something vaguely equivalent can be done, but it’s hard, skilled, and labour-intensive – which means expensive. Konami probably felt it was cheaper and more feasible to redo it from scratch than to atttempt to recover the original.
The original source code would have had meanigful names for things like variables and subroutines. It would have had comments explaining in plain language what the code at any particular point is doing, and things like what values input can take, and why things aren’t done a different way. And it would have been structured in a way that made it easy for humans to read and understand. Whereas the compiled code would have none of that. And it’s necessarily a close match. Not only would you lose all the names and comments, but the compiler would quite likely have shuffled things around to optimise the resulting code for execution. It might even have changed precisely how things are done in places into something functionally equivalent.
So, yes, you can pass the code through a program to partially reverse the process (a decompiler) and get a written version of the actual code. But what comes out won’t look anything like what originally went in. if you’re going to want to change ANYTHING, you’re going to have to go through that line by line, work out what areas in storage represent what, what each single instruction actually represents, restrucure the result to be useable, and so on. The people who wrote the code will have a better chance than people who’re unfamiliar with it, but if they’re actually still available to do it, rather than moved on to other things or even out the company, you’re lucky. And there’s a LOT of code. So basically what you have is the mother of all crossword puzzles. It’s going to take a lot of effort, and tie up skilled people. As a business you have to decide on whether it’s worth it, or whether there’s perhaps another approach.
I’ve attempted to write a few decompilers. It’s not easy. Often code and data is interspersed throughout the binaries… to a disassembler it doesn’t ‘know’ what is code or data. Usually things like jump tables or string offsets.
jmp table[rax]
table: func1, func2, func3, func4
func1: inc rdx
….
running a disassembler on it, it would try to generate code from the data at the table address. Some disassemblers (ida pro etc) now are smart enough to figure this out. There are also certain signatures (function entry/exit) that are common, or common library functions (libc/libm, Windows GDI, etc) that can mostly be detected or ignored.
One problem is modern compilers are VERY good at optimizing, and with things like constexpr functions and lambda functions in C++, the compiler will do work from the source code that never ends up in the binary. And if debugging information is stripped out of the result (which it would be for a commercially released game) you have no idea what each memory location means.
You may be able to get a flowchart of code execution, or running it through an emulator or code coverage tool you can see which code is actually executed (vs data).
You still won’t always know what every result means. And at best you could (auto) generate code that has made up variables: instead of code like:
if (player->hit()) {
player->hp -= 10;
}
you’d get code like
if (func_2394(mem_3942)) {
mem_3948-= 10;
}
“Source code” means many things. It includes text files of one or more computer languages, libraries of precompiled functionality (these often come with a Software Development Kit for a particular language), and various support files. For example, a makefile is essential instructions for how to put together the various pieces to yield the finished product, precompiling code files, then linking the resulting object files with library code in a specific way.
The makefile is essential to the process, and there’s no trace of it in the finished software. It’s also completely dependent on the linker being used.
The object code can indeed be de-compiled (there are commercial decompilers for popular languages) but the result is difficult to understand. For example, computer programs make widespread use of *variables,* pieces of data that contain a value. Variables are usually given meaningful names to make the code readable, for example dCurrentDate. The variable names are stripped away when compiled to object code, replaced by numeric references. So de-compiled code is missing all of the variable name cues that help to understand what is going on.
It is either difficult or in many cases, impossible to reverse engineer source code from compiled code. You can have variable names jumbled up to the point where you don’t know what the original code is doing. This addresses your second question. Many others have commented on this correctly.
For your first question, what is also missing is the concept of backing up the source code in another system automatically. Version control systems have been around for many years (logging changes from one version to another like with SVN and now more commonly, git), but automatically storing these changes in a another system has only been more recent. These are often in “the cloud,” or computers that are usually nowhere near you and backed up constantly. The individual codebases are referred to as “repositories” or “repos.” You have Github, Bitbucket, and Gitlab storing these repos online, making losing source code extremely unlikely to happen. Those types of automated systems weren’t readily available back then or may not have been used as widely in practice.
Many companies relied on an overworked, usually fat but good-natured person manually backing up systems to “tape drives” onsite. Sometimes even scheduling these tasks can involve human error. These cloud-based systems brought an immense amount of value to the background of many companies.
Latest Answers