Huffman coding

Richard Suchenwirth 2002-04-11 - In Huffman coding, characters (or other data items) are represented as bit sequences of varying length, so that the most frequent items have the shortest bit sequences. This way, storage requirement is reduced compared to fixed-length bit sequences, if the frequency distribution is appropriate for the input data.

Encoding Huffman data is dirt simple with string map. Decoding could have been implemented by a decision tree of 25 nested ifs (given the overly simple 26 character alphabet that I use - no digits or special chars, not even blank can be represented), but I expect such a construct to be badly maintainable, and it has the disadvantage that the data is hidden in the program structure. Therefore I decided for the slightly wasteful approach of using regexp to determine and chop off matching prefixes (this is possible because Huffman trees match the Fano condition that no code is the prefix of another code). It is still not noticably slow for short test words (150..200 msec on my P200 box, for ten-letter HELLOWORLD), and you can replace the default map by one that better fits your purposes, even at call time.

Finally, the resulting bit strings are of course longer than the input, as every bit uses one character (this is an educational toy after all). For practical use, one could use binary format to turn bit strings to unreadable byte sequences, or map sequences of six bits to one printable character. Such strings would be both (simply) encrypted and loss-free compressed (trailing zeros added in the final conversion would be ignored on decoding, because no code consists of all zeros - I had to change Y from 00000 in Huffman's original map to 000001 to obtain that added safety).

 proc huffman {mode data {map {}}} {
    if {$map==""} {
        set map {
          E 100 T 001 A 1111 O 1110 N 1100 R 1011 I 1010 S 0110 H 0101
          D 11011 L 01111 F 01001 C 01000 M 00011 U 00010 G 00001
          Y 000001 P 110101 W 011101 B 011100 V 1101001 K 110100011
          X 110100001 J 110100000 Q 1101000101 Z 1101000100
        }
    }
    switch -- $mode {
        e - encode {
            regsub -all {[^A-Z]} [string toupper $data] "" data
            string map $map $data
        }
        d - decode {
            set res ""
            while {[string length $data]} {
                foreach {out in} $map {
                    if {[regexp ^$in\(.*) $data -> data]} {
                        append res $out
                        break
                    }
                }
            }
            set res
        }
        default {error "usage: huffman (encode|decode) data"}
    }
 }

The above example map would render text which is more elaborate than just one uppercase word pretty unreadable. For practical application, one could analyze an object domain (a set of texts, for instance): count single character occurence frequency, but also n-grams (sequences of two or more characters); construct a Huffman map that fits best to the measured distribution. If long n-grams come before their shorter substrings (e.g. the before th before t or e), encoding would still work with n-grams (decoding always works as long is the Fano condition is met), and considerable compression effects can be expected. -- The break in the decode part brought time from 6.5 ms down to 3.4 ms for en+decoding a 21 letter word.

Eric Melski

You can get significantly better performance by using string map for decoding as well as encoding. This works again because none of the encoded values is a prefix of any other value. This gives a roughly 300x speed improvement on a long string of random characters:

 proc huffman2 {mode data {encodeMap {}}} {
    if {$encodeMap==""} {
        set encodeMap {
          E 100 T 001 A 1111 O 1110 N 1100 R 1011 I 1010 S 0110 H 0101
          D 11011 L 01111 F 01001 C 01000 M 00011 U 00010 G 00001
          Y 000001 P 110101 W 011101 B 011100 V 1101001 K 110100011
          X 110100001 J 110100000 Q 1101000101 Z 1101000100
        }
    }
    # invert the encode map to make the decode map
    set decodeMap {}
    foreach {out in} $encodeMap {
        lappend decodeMap $in $out
    }
    switch -- $mode {
        e - encode {
            regsub -all {[^A-Z]} [string toupper $data] "" data
            string map $encodeMap $data
        }
        d - decode {
            string map $decodeMap $data
        }
        default {error "usage: huffman2 (encode|decode) data"}
    }
 }

RS: Indeed... I was afraid that e.g. 100=>E would also hit on 01001=>F, because "100" is a substring there, but your code works just perfect... Thank you for this lesson in string maps power ;-)

Continued in Huffman coding, part 2