this post was submitted on 03 Jan 2024
104 points (95.6% liked)

No Stupid Questions

36176 readers
665 users here now

No such thing. Ask away!

!nostupidquestions is a community dedicated to being helpful and answering each others' questions on various topics.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)


Rule 1- All posts must be legitimate questions. All post titles must include a question.

All posts must be legitimate questions, and all post titles must include a question. Questions that are joke or trolling questions, memes, song lyrics as title, etc. are not allowed here. See Rule 6 for all exceptions.



Rule 2- Your question subject cannot be illegal or NSFW material.

Your question subject cannot be illegal or NSFW material. You will be warned first, banned second.



Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.



Rule 4- No self promotion or upvote-farming of any kind.

That's it.



Rule 5- No baiting or sealioning or promoting an agenda.

Questions which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.



Rule 6- Regarding META posts and joke questions.

Provided it is about the community itself, you may post non-question posts using the [META] tag on your post title.

On fridays, you are allowed to post meme and troll questions, on the condition that it's in text format only, and conforms with our other rules. These posts MUST include the [NSQ Friday] tag in their title.

If you post a serious question on friday and are looking only for legitimate answers, then please include the [Serious] tag on your post. Irrelevant replies will then be removed by moderators.



Rule 7- You can't intentionally annoy, mock, or harass other members.

If you intentionally annoy, mock, harass, or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.



Rule 8- All comments should try to stay relevant to their parent content.



Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.



Rule 10- Majority of bots aren't allowed to participate here.



Credits

Our breathtaking icon was bestowed upon us by @Cevilia!

The greatest banner of all time: by @TheOneWithTheHair!

founded 2 years ago
MODERATORS
 

I see a lot about source codes being leaked and I'm wondering how it that you could make something like an exact replica of Super Mario Bros without the source code or how you can't take the finished product and run it back through the compilation software?

you are viewing a single comment's thread
view the rest of the comments
[–] fenynro 72 points 11 months ago* (last edited 11 months ago) (1 children)

The long answer involves a lot of technical jargon, but the short answer is that the compilation process turns high level source code into something that the machine can read, and that process usually drops a lot of unneeded data and does some low-level optimization to make things more efficient during actual processing.

One can use a decompiler to take that machine code and attempt to turn it back into something human readable, but will usually be missing data on variable names, function calls, comments, etc. and include compiler-added optimizations which makes it nearly impossible to reconstruct the original code

It's sort of the code equivalent of putting a sentence into Google translate and then immediately translating it back to the original. You often end up with differences in word choice that give you a good general idea of intent, but it's impossible to know exactly which words were in the original sentence.

[–] Squizzy 9 points 11 months ago (4 children)

Thank you, sorry to push further but my understanding is that computers deal with binary so every language is compiled to machine code, which I took as binary.

So if the language has elements being removed and the machine doesn't need them shouldn't you get back out exactly what is needed to do the task? Like if you compiled some code and then uncompiled it you would get the most efficient version of it because the computer took what it needed, discarded the rest and gave it back to you?

[–] fenynro 17 points 11 months ago* (last edited 11 months ago) (1 children)

It depends on the specifics of how the language is compiled. I'll use C# as an example since that's what I'm currently working with, but the process is different between all of them.

C#, when compiled, actually gets compressed down to what is known as an intermediate language (MSIL for C# specifically). This intermediate file is basically a set of genericized instructions that are not linked to any specific CPU. This is useful because different CPUs require different instructions.

Then, when the program is run, a second compiler known as the JIT (just-in-time) compiler takes the intermediate commands and translates them into something directly relevant to the CPU being used.

When we decompile a C# dll, we're really converting from the intermediate language (generic CPU-agnostic instructions) and translating it back into source code.

To your second point, you are correct that the decompiled version will be more efficient from a processing perspective, but that efficiency comes at the direct cost of being able to easily understand what is happening at a human level. :)

[–] Squizzy 1 points 11 months ago (2 children)

Could I trouble you to go deeper? I'm think I'm getting it but if we were to say uncompile GTA V or Super Mario Bros, could we make changes and figure it out from there or would it be complete nonsense with no way points to jump in at and get a grip on what is being done.

On a side note I was told once that everything is 1s and 0s and as a result that someone could type a picture of you if they got the order right. This could be why I'm so wrong in my understanding given I'm now assuming this was bullshit.

[–] mindlessLump 8 points 11 months ago

Here is a real world example of someone doing some reverse engineering of compiled code. Might help you understand what is possible, and some of the processes. https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/

[–] [email protected] 3 points 11 months ago

At a very low level, yes, everything is 1s and 0s. However, virtually nobody deals with binary anymore. Programming languages are abstractions over abstractions over abstractions not to have to deal with typing binary.

The point of programming languages is for humans to be able to read it and make sense out of it. It’s a way to represent in a kind of intermediate language that’s halfway between something humans can read and computers can interpret.

Say the game’s programmer wants to handle moving your character right on pressing the right arrow key. They might write some function called “handleRightArrow()”, which does whatever. Then your compiler will turn this to some instructions - read stuff in RAM at address XYZ, copy it over, etc. The original code with readable names, comments, documentation, proper organization, it’s gone. Once you decompile, it’s gonna be random function/variable names, compiler might have rewritten some parts of the implementation as automatic optimizations, unlined some functions, etc. The human readable meaning of the code is lost. It does the same thing as the original code, but it isn’t the original code either.

[–] [email protected] 10 points 11 months ago* (last edited 11 months ago) (1 children)

The main issue is that to make code human-readable, we include a lot of conventions that computers don't need. We use specific formatting, name conventions, code structure, comments, etc. to help someone look at the code and understand its function.

Let's say I write code, and I have a function named 'findUserName' that takes a variable 'text' and checks it against a global variable 'userName', to see if the user name is contained in the text, and returns 'true' if so. If I compile and decompile that, the result will be (for example) a function named 'function_002' that takes a variable 'var_local_000' and checks it against 'var_global_115'. Also, my comments will be gone, and finding where the function was called from will be difficult. Yes, you could look at that code and figure out that it's comparing the contents of two variables, but you wouldn't know that var_global_115 is a username, so you'd have to go find where that variable was set and try to puzzle out where it was coming from, and follow that rabbit hole backwards until you eventually find a request for user input which you'd have to use context clues to determine the purpose of. You also wouldn't have the context around what 'var_local_000' represented unless you found where the function was called, and followed a similar line backwards to find the origin of that variable.

It's not that the code you get back from a decompiler is incorrect or inefficient, it's that it's very much not human-readable without a lot of extra investigatory work.

[–] [email protected] 1 points 11 months ago

This might change now relatively fast, now that large language models can process code, you could give the function to LLM to rename the function. Iterating over the code and rename all functions and variables.

This won't of course reproduce exact code, but it makes one really heavy part of reconstruction to human readable much lighter.

[–] [email protected] 5 points 11 months ago* (last edited 11 months ago) (1 children)

The implicit assumption with decompiling code is that the goal is either to inspect how the code works, or to try compiling for a different machine. I'll try to explain why the latter is quite difficult.

As you said, compilation to machine code only keeps the details needed for the CPU to accomplish what was instructed. And indeed, that is supposed to be efficient to run on that CPU, by reason of being targeted exactly for that CPU. But when decompiling, the resulting code will reflect the specificity to that same CPU. If you then try to compile that code for a different CPU, it will likely work, but will likely be inefficient because the second CPU's unique advantages won't be leveraged.

To use an example, consider how someone might divide two large numbers. Person A learned long division in school, and so takes each number and breaks it down into a series of smaller multiplications and subtractions. Person B learned to do division using a calculator, which just involves entering the two numbers and requesting that they be divided.

Trying to do division by blindly giving Person B that series of multiplications and subtractions to do on the calculator is extremely inefficient because Person B knows how to do division easily. But Person B is following Person A's methods, without knowing that the whole point of this exercise is to just divide the two original numbers. Compilation loses context and intent, which cannot be recovered from decompilation, for non-trivial programs.

Here is an example why source code is useful when it provides context: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code . Very few people would be able to figure out how this works from just the machine code.

[–] [email protected] 2 points 11 months ago* (last edited 11 months ago) (2 children)

follow up, would it be easier to read this context-less source code or stay at assembly? If for example you'd like to modify a closed source app

[–] [email protected] 2 points 11 months ago* (last edited 11 months ago)

Like many things, it's very fact-intensive, varying in different circumstances. As others have noted, the abilities of the person undertaking the decompilation will influence the decision. But so will strategy: the overall goal can drive how decompilation is approached.

For example, suppose you're working for an airline company and need to rewrite some software used on an ancient IBM System/360 machine and was written in the COBOL language, for which no source code is available and you cannot find many people who even know COBOL. Here, since the task is to rewrite the code, decompilation is just to tell you how it works and then you'll want to write the new program in a modern language. It may be useful to decompile to a different language if such a decompiler is available, say to the C language, which you better understand.

Sure, it may be that C isn't what the new program will be written in, but if your C reading skills are sufficient, then this is a valid strategy.

The skill of a decompiling engineer -- or any engineer really -- is leveraging your skills and your tools to tractably attack the difficult problem at hand. Many equally-skilled engineers can plausibly approach the same problem differently.

[–] fenynro 2 points 11 months ago (1 children)

Probably depends on how comfortable you are at reading assembly instructions for your specific CPU, but I think generally the contextless source code is probably preferable. Either way you've got a headache of an investigation in front of you though.

here's an example of what it might look like with either option

[–] [email protected] 2 points 11 months ago

oh wow, I now respect pirates even more. No wonder there are only like 3 guys that can and will do this.

If you decompile you need such an understanding of the language. I could see someone looking at this and going "oh yeah that compares cases", but then die of old age before finishing the sentance.

And if you don't decompile you are coding assembly.

[–] [email protected] 2 points 11 months ago

if you compiled some code and then uncompiled it you would get the most efficient version of it ... ?

Sorta, an optimizing compiler will always trim dead code which isn't needed, but it will also do things that are more efficient but make the code harder to understand like unrolling loops. e.g. you might have some code that says "for numbers 1-100 call some function" the compiler can look at this and say "let's just go ahead and insert 100 calls to that function with the specific number" so instead of a small loop you'll see a big block of function calls almost the same.

Other optimizations will similarly obfuscate the original programmers intent, and thinks like assertions are meant to be optimized out in production code so those won't appear in the de-compiled version of the sources.