Reverse Engineering


CSE 365 - Fall 2024

Welcome to your introduction to Reverse Engineering! Reverse Engineering is a critial art that you will evolve during your journey through pwn.college, and this journey starts here with the Connor Image Format, cIMG. You have never heard of cIMG before, and it does not actually exist in the wider world, but throughout this module you will reverse engineer it by analyzing various iterations of cIMG rendering binaries. By the time you solve the last level, you will have gone from no knowledge about cIMG to complete understanding of it, a path that you will walk many times with many different programs.


Lectures and Reading

There are many resources related to reverse engineering around the internet.

  • A good place to start is a series of walkthroughs of several hacking challenges by ASU's own Adam Doupe on his YouTube channel.
  • A comprehensive revese engineering tutorial series.

As mentioned in the slides, there are a number of useful tools for this assignment! Here is a (non-exhaustive) list:

  • gdb will let you run and inspect the state of these programs. Please check out the Debugging Refresher module. We have also provided a quick briefer here. Some useful gdb concepts:
    • Know the difference between step instruction (si) and next instruction (ni). It boils down to the fact that si will follow jumps, and ni will step over jumps. This means that if you use si, you will quickly find yourself crawling through libc code, which is insane and unnecessary.
    • You can use x/i $rip to disassemble the next instruction that will be executed. You can call display/i $rip to make the next instruction display every time gdb prompts you for input. You can also do x/2i and display/2i to print two (or other quantities of) instructions.
    • The disas command will disassemble the current function that you are looking at.
    • gdb can be scripted! Look up conditional breakpoints and scriptable breakpoints in the gdb manual.
    • Modern binaries are position independent, meaning that they can be loaded anywhere in memory when they run. GDB will load them at the offset 0x555555554000. This means that if objdump is telling you that main starts at some address like, 0x100, the address when debugging with GDB will be 0x555555554100
  • strings will list printable strings in the file. This is useful for looking for constant strings that the program checks for (such as file names and so on) in the course of getting input. Keep in mind that the options for string include a minimum size that it will print.
  • Don't forget about pwntools! You will need to interact heavily with these programs. Do it right (with pwntools).
  • rappel is a nice tool to help you figure out what certain instructions do.
  • Tools for reverse engineering actual binaries, included inthe dojo:
    • IDA is the industry standard of reverse-engineering tools.
    • Ghidra is an open source direct competitor to IDA that is used and loved by many.
    • Binary Ninja is an alternative product that competes with IDA in the space.
    • angr-management is an open source up-and-coming reversing tool with some advanced functionality.
    • In a pinch, objdump -d -M intel the_binary will disassemble the binary you want to look at. -M intel, in that command, makes objdump give you nice and readable Intel assembly syntax.

Challenges

The first few challenges in this dojo will walk you through the creation of your first basic cIMG image. Of course, the point of this is not to create images, but rather to experience the process of understanding a program, and by extension understanding logic and data formats used by the program, just from reversing its binary.

The /challenge/cimg binary in this level is (the start of) an image rendering program, specifically focusing on the cIMG format. Software identifies the formats of files (e.g., whether the file is a GIF, a JPEG, an MP3, or so on) in a few ways:

  1. The file extension. This is the part of the file after the .: a ZardusSmiling.jpg is probably a JPEG of Zardus smiling, whereas KanakLaughing.mp3 is probably an MP3 of Kanak laughing.
  2. The magic number. Files can get renamed, or the filenames associated with files can be lost (e.g., in a partial filesystem failure) or simply missing (e.g., in a data stream). Thus, most file formats include a magic number in the format that a parser can check to identify it.

You've already interacted with plenty of files containing magic numbers. For example, the ELF binary files you've worked with all start with the bytes \x7fELF. Of course, to you, this looks like a semantic-bearing string of characters, but a computer reads it as a number, hence the term.

In this challenge, you must craft a file with a cimg extension that contains the correct magic number. You can learn this magic number by reversing the /challenge/cimg binary. If you properly get past the magic number check, the challenge will give you the flag!


Approach Suggestions: Some hopefully-useful suggestions to get you started:

  • Reverse engineering can be done "statically" (e.g., in a graphical reversing tool such as IDA and the like, with the program you are trying to understand remaining "at rest") or "dynamically" (e.g., in a debugger such as gdb, with the program you are trying to understand running). We recommend a combination of these techniques throughout this module. Use your graphical reversing tool to form hypotheses about the program (e.g., "it compares some bytes of my input against something at this assembly instruction address") and then verify these hypotheses in gdb (e.g., break at the address in question, look at the values of the registers it compares, and correlate them with your input).
  • Leave objdump behind. You might have used objdump previously to look at assembly code. You might be able to solve this level (and maybe the next) with objdump, but you cannot do this module without a good graphical reversing tool. Use this challenge as impetus to begin gaining familiarity with a graphical reversing tool.
  • Retrace your successful solution. If you solve this challenge without using both a graphical reversing tool and a debugger (gdb), go back and re-verify your solution using the tools that you did not use. This will be good practice for understanding how to use these tools in later levels, and you should apply it in any challenge that you solve without relying on both tools.

The most expedient way for computers to verify a magic number is by treating it as, well, a number. That's what this challenge's /challenge/cimg does, to show you how these things are typically done in practice. Here, we have a different magic number (otherwise there'd be no need to reverse the binary!) from the previous challenge. Reverse the binary, keep endianness in mind, and pass the magic number check for the flag.

Programs that parse evolving file formats must be able to tell what version of the format it must parse. This is, often, stored right near the magic number. Figure out how to provide the right cIMG version to this /challenge/cimg!


Writing binary data: You will find that you need to create files with characters that you cannot type on a keyboard for this level. You can do this in a number of ways, but one of these ways is creating a Python script to craft this file for you:

First, open the file that you want to write.

with open("my-file", "wb") as out_file:

The "wb" above tells Python to open the "my-file" file for writing raw bytes. Once we have this, we can write to it. Of course, you are familiar with the typical writing of files:

with open("my-file", "wb") as out_file:
    out_file.write(b"HELLO WORLD")

As you can see, characters that you can type normally can just be put in the Python bytestring to send to the file. What about others? As you may have previously seen in Python bytestrings, you can specify characters by their raw byte value using the \x "escape sequence". For example, \x41 creates a byte with a hexidecimal value of 0x41, which is an ASCII A. First, let's use this to write the above in a different way:

with open("my-file", "wb") as out_file:
  out_file.write(b"HELLO \x57\x4f\x52\x4c\x44")

This also writes HELLO WORLD into the file! The escape sequences are parsed by Python when it creates the bytestring as it executes your code

You can use other values to craft otherwise-untypable characters. For example, here we insert a null byte (value 0) after HELLO WORLD:

with open("my-file", "wb") as out_file:
  out_file.write(b"HELLO \x57\x4f\x52\x4c\x44\x00")

Null bytes are used in a lot of binary formats and protocols, as are plenty of other bytes with hard-to-type values. For some common ones, there are other "escape sequences" you can use as shorthand. Here are a few examples:

assert b"\0" == b"\x00" # our null byte
assert b"\n" == b"\x0a" # a newline

You also need to "escape" characters that would, for example, interfere with Python's syntax itself. For example, this prints HELLO "WORLD"!:

with open("my-file", "wb") as out_file:
  out_file.write(b"HELLO \"WORLD\"!")

The double quotes above must be escaped so that Python doesn't interpret them as the end of the bytestring!


Writing integer values: Of course, some bytes in a file format represent integers, typically stored in little-endian format. Writing these will require you to "pack" a typical integer (e.g., 5) into its binary representation (this depends on the size of the variable. For example, a 32-bit/4-byte 1 would be b"\x05\x00\x00\x00).

To convert between integers and raw bytes, check out the struct Python package.

with open("my-file", "wb") as out_file:
  # this packs the integer 1337 (0x539 in hex) into four little-endian bytes
  out_file.write(struct.pack("<I", 1337))

  # the above is equivalent to
  out_file.write(b"\x39\x05\x00\x00")

  # this packs the integer 1337 (0x539 in hex) two little-endian bytes
  out_file.write(struct.pack("<H", 1337))

  # the above is equivalent to
  out_file.write(b"\x39\x05")

If you are curious, you can look at other format specifiers in struct's documentation.

Let's continue building out the cIMG format. What does an image need? Dimensions! Specify the correct ones here, and grab your flag!

Some programs impose specific constraints on their inputs. Keep building your knowledge of the cIMG format, but be aware of the restrictions that /challenge/cimg places on the additional data you have to send in this level.

It is time to look at your first cIMG! Make the program display an image with the correct properties, and it will give you the flag.

It's time to upgrade to a new version of the cIMG, getting much closer to usurping the boring old image formats of the web. Spot what's different, understand what /challenge/cimg wants, and get the flag!

Programs keep extensive internal state about the actions they have taken or should take in the future. This version of /challenge/cimg tracks much more state than before, which allows it to impose strict requirements on your input before it gives you the flag. Figure out what image it wants, and score!


Approach Suggestion: This challenge processes and displays your input like before, and in the process maintains an expected state to check. To understand what your input should be, you might consider the following approach typical of a Reverse Engineering pipeline:

  1. Understand what the program expects its internal state to be to give you the flag. You should do this using a combination of an graphical reversing tool (IDA, etc) to form hypotheses about what the program is doing to your input and a runtime debugger (gdb) to verify those hypotheses at runtime. For example, at some specific assembly instruction, the program makes a decision about whether or not to give you the flag. Find this point in your graphical reversing tool, then verify your understanding at runtime with gdb. Strive to understand what the data that it is checking agianst means, at least on a high level. A check is typically done between some questionable data (in this case, some transformation of your image) and known-good data. What is the latter in this scenario?
  2. Understand how the program uses your input to generate the "questionable" data that it checks in its decision making process. Programs implement algorithms, and these algorithms can be understood. What's the algorithm that transforms your input into the data used by the program to make decisions? Can you verify this hypothesis with gdb? Can you use this verified knowledge to tweak your input to change the program's behavior just a bit (e.g., get a bit further before the program decides you don't get the flag)?
  3. Many algorithms that transform data from A (in this case, your input data) to B (in this case, the "questionable" data that is compared against the known-good data) can be inverted to create an algorithm that transforms B back to A. Can you use the knowledge derived from step 2 to create such an algorithm? In other words, can you create an algorithm to translate from the internal state of the program back to your input data? First, manually apply it to some data to see if it makes a difference to how the program treats your input.
  4. Now you have an algorithm that can take the program's expected state and produce input data that will cause the program to reach that internal state. Next, you need to extract the "known-good" internal state that the program expects. This can be done in many ways, but the two go-tos are your graphical reversing tool and your debugger (gdb): both can accomplish the task of extracting the expected state to a file.
  5. Apply your algorithm from Step 3 to the extracted known-good state in Step 4 to get an input that should trigger the expected state. Plop the right metadata on it (whatever magic numbers and so on that the program needs) and feed it into the program!

Now we come to the final change that we will make to the high-level design of cIMG. File formats tend to give the parsing program a number of directives to carry out different actions. In HTML files, for example, each tag is a directive that causes the rendering engine to take action. Image files are similar. We'll start our exploration of this relationship between the format and the parser by extending cIMG with a similar concept.

Support for different directives allow fairly advanced cIMG functionality. For example, you previously created the images you needed by specifying all of the pixels explicitly. However, with the advanced functionality added in this level, that is no longer necessary! Consequently, the challenge will not give you the flag if you use too many bytes to make the right image. Good luck.


Approach Suggestion: This level will require you to create a cimage with multiple directives in one file. Some hopefully-useful suggestions:

  • Your first several attempts at this will likely result in an error message. Do not simply guess at how to fix this error message! Instead, use a combination of a graphical reversing tool and a debugger (gdb) to actually understand which check is failing, and adjust your input to avoid failing this check.
  • Writing your cimage by hand will be very error-prone. Consider creating a Python script to generate cimages. Your script should somewhat mirror the parser library, with a function to generate the cimage header and functions for every directive that you want to support. This will make your life MUCH easier in this and future levels.

How optimal was your solution to the last level? If you truly understand the subtleties of cIMG specification, you can optimize things even further...


NOTE: This challenge is a side quest. The knowledge you get from this challenge is not necessary to proceed further; it's just more practice and a better grade! If you are stuck here, feel free to skip it and come back later.

That last level was pretty tricky. To make it up to you, we'll give you the flag in this level ... as a cIMG!


HINT: There are a few ways to solve this level. We recommend using a hex editor (such as hexedit) to edit your image!


HINT: The challenge uses the figlet program to generate the flag letters. If you are having trouble discerning a particular letter, run figlet -fascii9 pwn.college{XYZ} and compare the output to test your hypotheses.

Patches were cool, but sprites are better. Can you optimize the image size even more?

How well do you grasp the cIMG format? This is a chance to show yourself just how much you've learned!

This level's /challenge/cimg has no way to give you the flag, but we'll give you a cimg file containing it!

This level explores trade-offs between adding just a bit of complexity to a software feature (in this case, the cIMG sprite functionality) and its resulting functionality improvement (making the cIMG file smaller!). We might be getting close to optimal cIMG sizes here, and /challenge/cimg will be very demanding!

Often times, as feature bloat makes a software project more and more complicated, vulnerabilities slip in due to the interaction of too many moving parts. In the course of reverse engineering the software, reverse engineers will often spot such vulnerabilities.

This is one such scenario. Find and use the vulnerability in /challenge/cimg to get the flag!

The extensibility of our cIMG directives list continues to amaze!


HINT: Keep the last level of Linux Luminarium in mind. It will be useful.

Okay, the previous level was embarrassing, but those sorts of bugs happen all the time! We fixed that issue in this level and, because we feel somewhat contrite about the bug, we're giving you the flag again... in an animated cIMG! Good luck.

Another goal of reverse engineering is for interoperability. Time is a harsh mistress, and many things can happen to or around software as time goes on. For example, source code can get lost, as it is typically secretive and internal to a company in question, but binary files tend to persist, as they are widely (e.g., commercially) distributed. When source code is lost, interoperability must be achieved by reverse engineering the binary file to either reconstruct or adapt them.

In this challenge, we've recovered the source code of the game The Epic Quest for the Flag (/challenge/quest.py), but its customized cIMG-based graphics engine is missing! We've provided you a close approximation (a normal cIMG parser in /challenge/cimg), but it does not work out of the box. You'll need to binary-patch it for compatibility! Can you uncover the flag?


NOTE: Note, /challenge/quest uses cimg as a graphics engine, but it's built for a custom version of cimg that you do not have. You run it with /challenge/quest NOFLAG | /challenge/cimg to run it in "compatibility mode": no flag, but compatible with the standard cimg. If you want the flag, you'll need to modify the cimg to work with the quest, and run it via /challenge/quest | /home/hacker/your-patched-cimg.


HINT: You can either patch using your reverse engineering tool of choice or figure out the file address of the bytes you want to patch and use hexedit.

The previous level was lucky --- no actual code had to be patched. This level will be your first foray into binary patching at the code level, but the only things you'll need to patch will be simple constants used by x86 comparison instructions. Good luck!

This level is trickier: you'll need to patch slightly trickier instructions. This gets you much closer to real patching for interoperability!


30-Day Scoreboard:

This scoreboard reflects solves for challenges in this module after the module launched in this dojo.

Rank Hacker Badges Score