Investigating a firmware image and its compression

Downloading a firmware image and opening it up for the first time can be a great motivator - a list of bytes that somewhere hides executable code, just waiting to be analysed. Maybe you'll be the first person to have done so. Maybe you're the thousandth but you're up for a challenge.

I'm sure there's two types of people - those who lose all hope when binwalk doesn't spit out a single useful signature match, and those whose motivation to uncover the hidden secrets skyrockets. I'm not sure where I stand; it takes me a couple hours staring aimlessly at a hex editor, head-in-hands, for the motivation to completely fizzle.

So with a firmware image acquired, I turned to binwalk and xxd for some light. Light that neither tool decided to shed onto the structure of the image.

Binwalk showing a false signature for a "MySQL ISAM index file" and a single copyright string. XXD is run on the file, which shows no discernible signature or structure to the file.

That's alright - it doesn't take much to fool binwalk into not displaying any matches. Let's employ another fun trick: converting our binary file into a PGM image (thanks scanlime) to get an idea of how the file is structured. The 8600E134.dat file is 10,383,159 bytes long - just a few bytes more than a 512 pixel wide, 20,000 pixel tall, 1-byte-per-pixel image. That corresponds to a PGM header of P5 512 20000 255\n which we'll prepend to the binary (click here for a bigger, not truncated version).

The PGM file shows a small section of noise/data, followed by a short section of white, a short section of noise/data, a very large section of noise/data, a large section of white, and a short section of noise/data.

Looks like our file is actually a number of separate images, separated by large swaths of white 0xFF bytes. Not much use to us just yet.

Can strings give us a lead? If the file was simply a raw flash image that the processor loads and runs from 0, we'd expect to see the application's strings just sitting around in the binary. In this case since our binary starts with a very ASCII-looking string, there might be a header in the way, but that won't change the presence of raw strings further on in the file.

Unsurprisingly, running strings 8600E134.dat > 8600E134.dat.strings produces a 186,938 line file. Opening the listing in a text editor reveals something interesting however: not a single string of decent length was in one piece. Did we break strings, or is something else going on?

The strings file is open in Visual Studio Code, where the following lines (representing separate strings) are shown: "Updatin", "g the fi", "rmware i", "s very r", "isky. If". This pattern of broken strings continues.

In a hex editor, it's clear that strings knew what it was doing. All of these strings are... odd. We've got enough of the string to work out what it once was, and the manual shows us what the original is. My "revelation" here may have already been spoiled for you, but see if you can spot what I eventually did.

The raw firmware image is open in a hex editor, where the string snippets previously seen are visible. They are not null-terminated, and are broken up by single bytes and on occasion much longer sequences of bytes with unclear purpose.

The device's firmware update screen, which reads "Updating the firmware is very risky. If you make a mistake, the IC-R8600 may not operate properly, and repair at..." which continues for many more lines.

For a while, the entirely missing segments of the string baffled me. Where's the end of "mistake"? Where did the "ake" go? But with time, the revelation came - not more than 10 characters earlier did the string "ake" also occur. The "oper" of "properly"? Already there, at the start of "operate".

We've got some kind of compression here.

Compression

Compression formats all operate on a shared principle - during compression, repetitive data can be replaced with some kind of marker which signals to the decompressor to stop, write out a sequence of bytes from some known source, and continue. Non-repetitive, or high entropy data can simply be stored as-is: no space savings can be found by replacing it with marker values, as at length it's probably unique within the file.

Some compression formats utilise an explicit dictionary to perform this task - either running through the file before compression begins to find an optimal list of such repetitive byte sequences, or by progressively building a dictionary as compression is performed. In either case, the algorithm's marker values are simply a command to the decompressor to "grab the dictionary string at index 0x13," (for example) "and write it to the output".

Markedly different are compression algorithms which utilise a sliding window. Rather than creating an explicit dictionary which maps indices to byte strings, marker values encode an offset from the current file position and a length: when a marker is encountered while decompressing, the decompressor seeks backwards by the offset value and copies the specified number of bytes from the already decompressed data into the output file.

The missing repetitive values we see in our data are short, and when data is missing the block to be repeated occurs very close to the gap. Sounds like a sliding window algorithm, right? An explicit dictionary format probably wouldn't find adding a string as short as "ake" or "oper" to its dictionary as a smart (i.e. file-size reducing) move.

Let's run with this hunch.

Our format

One thing you may have noticed in the hex view of the image was that every string of 8 literal characters (i.e. not compressed, and should be written directly to the decompressed file) was prefixed by an 0xFF byte.

# ^ tons of bytes removed
7F 6F 7F 81 7F 93 7F FE 9C 75 55 70 64 61 74 69 6E # not sure what's going on here...
FF 67 20 74 68 65 20 66 69 # "g the fi"
FF 72 6D 77 61 72 65 20 69 # "rmware i"
FF 73 20 76 65 72 79 20 72 # "s very r"
FF 69 73 6B 79 2E 20 49 66 # "isky. If"
0F 20 79 6F 75 9C 7F EA 7F FC 7F 4B 94 FF 06 6D 61 6B 65 # ...or here
# v tons of bytes removed

We can't just go searching through the file for 0xFF bytes, since when we don't know the entire plaintext we can't be 100% sure that the 0xFF byte we found isn't actually meant to end up in the plaintext. We can be 99.99% sure that in this case, our 0xFF byte is actually some kind of "8 literal bytes follow" marker, because we know the plaintext and no other interpretation would really make sense. We almost fall "out of sync" as soon as we hit a non-0xFF byte after a 0xFF [8 literal byte] sequence, since we can't be sure what that next byte tells the decompressor to do. If we continue to go at this alone, without many, many hours of reverse engineering, we've kinda hit a dead end.

Final Fantasy VII, aka great LZSS documentation

I can't tell you how exactly I worked out what compression algorithm the firmware image used. To be honest, I think I stumbled upon some documentation and the Wikipedia article much earlier in the RE effort than I remember (after searching something like "0xFF 8 literal bytes"), but I wrote it off pretty quickly for some reason. I think it was partially due to misinterpreting the docs, but that's in the past now. Who knows.

Turns out our compression is LZSS, and our win comes in the form of documentation for the Final Fantasy VII data file format. Combined with Wikipedia's outline of LZ77 (which utilises similar principles, and was helpful in understanding how the LZSS format operates), we've got enough to write our own decompressor.

I'll "summarise" the LZSS format here.

An LZSS compressed file is comprised of a header (4 bytes - simply the decompressed length of the file stored in little endian), followed by a series of blocks.

Blocks

Each block begins with a control byte, which dictates how the bytes which follow the control byte are to be interpreted. A control byte is always followed by 8 pieces of data (note: not necessarily 8 bytes), which may each be either a literal byte (1 byte long), or a reference (2 bytes long).

Each bit of the control byte defines what pieces of data are present after the control byte, with a 0 bit indicating a reference and a 1 bit indicating a literal. Bit 0, the least-significant bit, defines the piece of data which immediately follows the control byte. Bit 7, the most-significant bit, defines the piece of data which immediately precedes the next control byte.

As each byte of the decompressed file is calculated, it is written to a circular buffer of length 4096 bytes. The first byte of the decompressed file is placed at index 0xFEE of the buffer, and the index incremented by one byte. Subsequent bytes are written at the current index, the index incremented, and the index wrapped around to 0 if it exceeds 4095 (the last position in the buffer).

The buffer is initialised to all zeros.

Control byte Data
0xFF = 0b1111_1111 8 literals
0xFE = 0b1111_1110 1 reference, followed by 7 literals
0x7F = 0b0111_1111 7 literals, followed by 1 reference
0xE3 = 0b1110_0011 2 literals, followed by 3 references, followed by 3 literals

Literals

Literals are one byte in length, and as the name suggests - should be copied into the output file untouched.

References

A reference is two bytes in length, and encodes both a length and offset. The offset is not relative to the current index, i.e. an index of 0x210 is 0x210 bytes from index 0 of the circular buffer. When encountered, the requested number of bytes should be copied from the circular buffer and into the output file, beginning at the requested offset into the buffer.

Byte p p+1
Bit 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Value Offset (low byte) Offset (high nibble) Length (stored as length - 3)

Note: offset = [high nibble] << 8 | [low byte].

The length of a reference is stored minus 3 as any reference which pulls in less than 3 bytes would take up less space if written as a pair of one or two literals. The data referenced by a length-offset pair can include data which has not yet been written to the file (i.e. offset 2, length 4) as - operating one byte at a time - the bytes which appeared to not yet exist are written before they are needed (Wikipedia explains this better).

Decompression

Before we can think about decompression, we must isolate the compressed block from the firmware. Recalling the 0xFF-delimited blocky structure of the image we saw earlier, if we work backwards from where we found our strings above, we can find the first couple non-0xFF bytes of the block. They're more than likely the start of the block.

The firmware image is open in a hex editor. 0xFF bytes are constant, until offset 0x10040, where actual data and a semi-ASCII string is visible.

The end of the block isn't as easy - we've got to be mindful that we're not 100% sure where the block ends (since the decompressed section could be end-padded with 0xFF which would look just like our filler data!) and so it's safer that we just chop the extra front-matter off instead. Our compression format stores the decompressed length, so we're unlikely to accidentally start attempting to decompress data that isn't actually compressed.

We'll copy from the offset 0x10040, where the compressed data starts, to the end of the file. That becomes compressed.bin.

With all this information about the format and our compressed data, we can implement a quick Python script to do our decompression.

The Python decompression script is run on compressed.bin. XXD is then run on the decompressed output, showing the first 10 lines. A human readable string and patterns are noticeable in the output.

We have some known plaintext, so let's go looking:

The strings command is run on the decompressed bin file and piped to grep which searches for the full string "Updating". The output shows the original strings from the firmware update dialog, and more.

Much more like it :)

A retrospective

After I'd continued the RE process it turns out that this firmware image is (as suspected) simply written into on-board flash and executed from 0 (skipping 0x40 bytes of header). The firmware's first step is to load a second stage from flash into RAM, which then loads the actual application into RAM by decompressing it with LZSS 🤦‍♀️

My first step could have been doing the research to find that the device runs on an ARM Cortex-A9, loading the image into a disassembler, seeing some plausible code and following the loading stages... but then, instead of trying to find a mystery compression algorithm I'd probably be staring at a bunch of machine code, trying to reverse engineer an algorithm I'd have assumed was proprietary or just obfuscation. It's easy to say in hindsight that this would have been a better route and "I totally would have realised it was a common algorithm!" but I don't think that's true. The only reason I was able to locate the decompression routine in the second stage was because I knew to look for the 0xFEE constant - information I wouldn't have known if I'd started from code.

Lessons learned, if nothing else.

You might also be interested in:

See all posts