Compression Crash Course
Compression Crash Course by Pierre Roux
File compression is this magic that computers do in order to decrease the size of a file. It seems simple enough, drag a file of your choice into WinZip and see it take up less space than before. But wait a minute – why aren't all files just compressed from the start then?
When we open up a .zip file with trusty old Notepad, we see some text looking like this "...›É§4A ó¦q²lûÝ §...". This gibberish is the reason why all files aren't compressed in the first place. But what we need to understand, is that this gibberish is really just a bad representation of the underlying data.
Computers have no comprehension of letters, or digits or even numbers for that matter. Every letter you are reading here, is actually just a numerical value that we interpret with numbers between 0 and 255. The computer will check this number against the ASCII table, which tells it that the value 65 corresponds to the capital letter "A". This is partly the reason why compressed files seem to be gibberish, we were misinterpreting those values!
To understand what is wrong with our interpretation, we need to look at a basic compression technique. The alphabet contains 26 characters, and on the ASCII table it starts at value 65. This means that every letter's value is equal to its position in the alphabet, plus 64. "A" is 1st in the alphabet, plus 64 gives us 65, which is the ASCII value of "A". It is so much easier writing them in values of 1 to 26 instead, so let's do that and just add a note saying "plus 64" in the file. We now have at least 9 letters that can be written with single digit values. This is how we save on space.
|Portion of ASCII table|
Let’s look at an example. The word CABBAGE will be printed out in both raw and compressed ASCII values. It looks like this:
|Compressed versus raw data|
|Cabbage||67, 65, 66, 66, 65, 71, 69||Raw ASCII values|
|3, 1, 2, 2, 1, 7, 5||Compressed ASCII values|
It is important to note that this difference in size becomes magnified when we look at the binary system for these letters. To display any ASCII value, you need 8-bits in binary (this is 255 in decimal), but in order to display a limited set of 32 letters, you only need 5-bits in binary. Imagine you were counting from 1 to 10 but saying aloud the placeholder zeroes every time.
So, let’s just add the little note at the end of the file, saying that to use these values one must add “64” to them. But if an unaware program attempted to read this file of ours, it would not understand our note. Instead, the program would read our values of 1 to 26 and check them against the ASCII table, where the symbols for the values of 1 to 26 look like gibberish. Luckily, we have implemented standards to stop it from looking like this:
This is in essence the foundation of compression, to write the same details and information, using less space. Now that we know compression isn’t magic, how can we make this better?
So, let’s just add the little note at the end of the file, saying that to use these values one must add “64” to them. But if an unaware program attempted to read this file of ours, it would not understand our note.
Instead of using the entire ASCII table, we could design our own table, only for the file we are compressing. This sounds crazy – a new mapping table for every file would waste precious space, right? It turns out that in the bigger picture, the table is really only just a tiny portion of the final result. Let’s look at some example text and attempt to compress it with a special table.
Computer science is no more about computers than astronomy is about telescopes. - Edsger Dijkstra
Note how we only use 16 different letters (a, b, c, e, h, i, l, m, n, o, p, r, s, t, u, y) for this example. Now we can use a 4-bit binary value to define each letter, but to keep it simple we will talk in decimals instead. The compression factor here is already 50% when we compare our 4-bit version versus the 8-bit ASCII standard.
Now we will use this table above to encode (to compress) the example text from the Dijkstra quote. Simply put, we look at each letter to find its corresponding number value on the mapping table and write that down instead of the letter. When we want to decode (to decompress) this result, we simply do the reverse: Find the number that matches the encoded value on the mapping table and choose the letter on that table to write down instead. We will remove some complexity by forcing our example text into a lowercase character set, this is because computers see capital letters as different letters since their ASCII values are not the same as the lowercase characters.
|Example text encoded using different tables (Spaces replaced with underscores)|
|Standard ASCII||4-bit example|
|99 111 109 112 117 116 101 114 _ 115 99 105 101 110 99 101 _ 105 115 _ 110 111 _ 109 111 114 101 _ 97 98 111 117 116 _ 99 111 109 112 117 116 101 114 115 _ 116 104 97 110 _ 97 115 116 114 111 110 111 109 121 _ 105 115 _ 97 98 111 117 116 _ 116 101 108 101 115 99 111 112 101 115||2 9 7 10 14 13 3 11 _ 12 2 5 3 8 2 3 _ 5 12 _ 8 9 _ 7 9 11 3 _ 0 1 9 14 13 _ 2 9 7 10 14 13 3 11 12 _ 13 4 0 8 _ 0 12 13 11 9 8 9 7 15 _ 5 12 _ 0 1 9 14 13 _ 13 3 6 3 12 2 9 10 3 12|
In the previous results it is easy to see that some letters are used more often than others. The letter “o” is used 9 times, but the letters “h”, “y” and “l” are only used once each. If we can determine which letters occur most frequently, we can design the encoding table in such a way that those letters have smaller values on the table (which correlates to shorter binary bit values). Since we are just mapping letters to unique identification codes at this point and they have no value number associated to them anymore. We will simplify this and use the term “code” instead of “value”.
To design such a system, we first need a “frequency table”. This contains the count of how many times each letter was used in the example. We will build the codes from the frequency table.
Generating these codes will require building a Huffman tree, which is really just a pairing of these letters into little groups. We start with the 2 letters with the lowest occurrence frequency, “l” and “y”. Label them with digits 0 and 1, then replace their original entries in the frequency table with this paired entry. This pair will now be placed at the bottom of the Huffman tree.
Now we will look at the next least frequent letters. They are “h”, and the new pair of “l” and” y” with a combined frequency of 2, which we will use just like any other letter. Again, label these with binary digits, replace their original entries in the frequency table with this new pairing. Now build one more node in the Huffman tree.
We can continue on like this, placing new nodes on the tree as letters are paired up, but eventually we will pair up the last two entries. This means we have built the entire Huffman tree, that in itself is already the final encoding table we set out to design.
We read this table from the top, following the arrows and remembering the binary digits as we pass them on our way to a specific letter of our choice. The “o” letter was used most often, so looking at its unique code we start on the tree, we follow the arrows downward to “o” and record its code as being “110”
If we do this for all letters, we can transform this information into an easy to use encoding table. Important facts to note are that the more commonly used letters (o, e, s, t) have far shorter codes than less often used letters. This way we get a really good compression factor going for most of the text in our example, at the cost of using long 7-bit codes merely a few times.
Now we can encode our example text with these binary codes, add this table to the file and send it to our friend on a faraway network. They now have all they need to decode this data into the original meaning. To do this, they simply start reading the coded string we sent them, one digit at a time. We know the first letter in the example is the letter “c”, but they don’t know that. When they read the first 3 digits (101…), they will realise there is no match in the table thus far. But when they read the fourth digit, they will find a match on 1010 which is the code for “c”.
They can do this for the rest of the message until everything is decoded into a human readable format. Keep in mind that this encoding table is mostly unique to our example text: You can build slightly different Huffman trees depending on how you sort the frequency table. “a”, “m”, “n”, “r” and “u” all had equal frequencies of 4, and I arbitrarily sorted them alphabetically but there is no reason why you cannot sort them differently. This is an excellent reason to send the final encoding table which is less ambiguous than the Huffman tree.
|Raw Binary||Encoded with Huffman coding|
|1 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0
|Computer science is no more about computers than astronomy is about telescopes|