Problem Set 4 Solution



Problem 4. (Make it Smaller)

The goal of this problem is compression: take a le and make it smaller. We will take every byte of the input le and re-encode it so that the entire le is smaller.

By default, each ASCII character is stored using 8-bits. But this is quite ine cient: some characters (like ‘e’) appear much more frequently. Others appear much less commonly. We could save space by encoding an ‘e’ using only 3 bits, while using 12 bits to store every ‘z’ in the document. To be more precise, imagine a document has W characters in it, and character ci appears wi times in the document. Then the ideal thing to do is store character ci using log(W=wi) bits.1

In this problem, we will use a binary tree to generate the codewords.

Each leaf in the tree will represent a symbol in the input le. The codeword is found by looking at the path from the root to the leaf containing the symbol: start at the root, and every time you go left, add a ‘0’ to the codeword; every time you right, add a ‘1’ to the codeword. When you reach the leaf, return the codeword itself.

A key property of this type of code is that they are pre x-free codes: if c is the codeword for some symbol, then c is not a pre x for any other symbol (e.g. f00; 01g is pre x-free but f00; 001g is not pre x-free because 00 is a codeword and a pre x of 001). This is very useful because it means that

1The reason why that is optimal is related to the idea of entropy.


if we ever see the codeword c, we know exactly how to decode it (without worrying that it is part of some longer codeword).

In this problem, we will provide you with most of the mechanism for encoding and decoding les. The only part you have to do is the part related to tree manipulation: build the tree, and implement two query methods: one that translates symbols to codewords, and one that translates codewords to symbols.

There are two versions of this problem: the easy version and the harder version. You are required to do one of them (and a small bonus for doing the harder one).

In the easier version, you can ignore the frequency with which each symbol appears. In this case, you only have to build a balanced tree. However, the compression performance will not be as good, since you are not getting any bene t from prioritizing repeated characters.

In the harder version, you will weight the tree according to the frequency that each items shows up. This will get you close to an optimal encoding.

The easier version should be implemented in the UniformTree class, while the harder version should be implemented in the WeightedTree class.

Problem 4.a. Your rst task is to implement the buildTree routine. In both cases the goal is to build a tree where all the symbols are stored at the leaves (i.e., the internal nodes do not contain any symbols, just guides for searching). The tree should be in BST-order.

It is important that the symbols are at the leaves, since that ensures that no codeword is a pre x of another codeword!

Easier version. The buildTree method is called in the constructor. An array of bytes is passed in, each representing a unique symbol in the input le. When the buildTree routine nishes, the member variable root should be the root of a balanced binary tree. The tree should be in BST-order (with respect to the symbols) so that we can search the tree for a symbol e ciently. We have included a TreeNode class as part of the UniformNode class; you may want to look at that. (You should not need to modify it.)

You can assume that there is 2n unique characters, making it possible to build a full binary search tree.

Try to make your tree construction algorithm as e cient as possible, e.g., linear time.

Harder version. In this case, the buildTree method takes as input an array of pairs: the symbols and their weight (i.e., the number of times it appears in the input document). Again, you want to build a tree that is in BST-order, so that it can be e ciently searched. The Pair type supports the Comparable interface, so you can sort it using system sort (java.utils.Arrays.sort) and it will sort by the symbol.


However, now the tree should be weight-balanced rather than simply balanced: the sum of the weight on the left side should be about equal to the sum of the weight on the right side.

Of course, exact balance cannot always be achieved because weights are discrete. Thus the goal is as follows: if W is the weight of some tree node v, ensure that neither the weight of v’s left child nor that of v’s right child exceeds 2W=3.

There is one exception to this rule: if there is one item x that is of size at least W=3 all by itself, then the side containing that item may have more than 2W=3 weight. However, at the very next level of the tree, x should become a leaf.

In the above example, the node on the right is a weight ratio that is more than 2W=3, and is a intermediate node, so it is not allowed. However, in the second example, although the weight is 0:1 and 0:4, this is acceptable because the 0:4 is a single node on its own.

If you do the construction properly, an item with weight wi should be at a leaf of depth O(log(W=wi)), (i.e., an optimal depth), where W is the total weight of the tree,

Try to make your tree construction algorithm as e cient as possible, e.g., O(n log n) time.

Problem 4.b. Your second task is to implement a query to nd a codeword. The queryCode method takes a key as an input (in the form of a byte) and returns the codeword, in the form of a boolean array. (We will think of the codeword as a boolean array because we will write it to the le as a sequence of bits.)

You should be able to implement this by walking the tree from the root to the leaf of the appropriate symbol. If the symbol is not found at a leaf, then return null.

For the codeword returned, if code is the boolean array, then code[0] should be the rst bit of the codeword (i.e., the step taken from the root), while code[1] should be the second bit of the codeword, etc.

Implement the query in an e cient manner.


Problem 4.c. Your third task is to implement a query to nd a symbol, given a codeword. The query method takes a boolean array code and an integer bits as an input. The method will then output a key (in the form of a byte). The code array is only valid in the range from code[0..bits-1]. (The rest of the array is declared in advance to act as a bu er, since we do not know how big the codeword will be in advance.)

The goal, then, is to lookup the codeword code[0..bits-1] in the tree and if the result is a leaf, return the symbol found. Otherwise, return null.

For the codeword, if code is the boolean array, then code[0] should be the rst bit of the codeword (i.e., the step taken from the root), while code[1] should be the second bit of the codeword, etc.

Implement the query in an e cient manner.

Once you have implemented the above three methods, the compressor and decompressor classes should work. Simply update the input and output les in the main routine of each of them and you should be able to compress and decompress accordingly. How good a compression rate do you get? (For text les with 8-bit ASCII text, I would expect you can save at least 3 bits per character.)

Problem 4.d. (Optional.) You can get better compression by being a little more clever. There are several things you might try. A natural thing to try is to build a Hu man Tree. How much better performance do you get? Another idea is that you might do better in text documents if you looked at words instead of characters: what if you map each word in the text to a leaf in the tree? Do you get better compression? What is the best compression you can get for, say, Hamlet?