proc huffman {mode data {map {}}} { if {$map==""} { set map { E 100 T 001 A 1111 O 1110 N 1100 R 1011 I 1010 S 0110 H 0101 D 11011 L 01111 F 01001 C 01000 M 00011 U 00010 G 00001 Y 000001 P 110101 W 011101 B 011100 V 1101001 K 110100011 X 110100001 J 110100000 Q 1101000101 Z 1101000100 } } switch -- $mode { e - encode { regsub -all {[^A-Z]} [string toupper $data] "" data string map $map $data } d - decode { set res "" while {[string length $data]} { foreach {out in} $map { if {[regexp ^$in\(.*) $data -> data]} { append res $out break } } } set res } default {error "usage: huffman (encode|decode) data"} } }The above example map would render text which is more elaborate than just one uppercase word pretty unreadable. For practical application, one could analyze an object domain (a set of texts, for instance): count single character occurence frequency, but also n-grams (sequences of two or more characters); construct a Huffman map that fits best to the measured distribution. If long n-grams come before their shorter substrings (e.g. the before th before t or e), encoding would still work with n-grams (decoding always works as long is the Fano condition is met), and considerable compression effects can be expected. -- The break in the decode part brought time from 6.5 ms down to 3.4 ms for en+decoding a 21 letter word.
Eric MelskiYou can get significantly better performance by using string map for decoding as well as encoding. This works again because none of the encoded values is a prefix of any other value. This gives a roughly 300x speed improvement on a long string of random characters:
proc huffman2 {mode data {encodeMap {}}} { if {$encodeMap==""} { set encodeMap { E 100 T 001 A 1111 O 1110 N 1100 R 1011 I 1010 S 0110 H 0101 D 11011 L 01111 F 01001 C 01000 M 00011 U 00010 G 00001 Y 000001 P 110101 W 011101 B 011100 V 1101001 K 110100011 X 110100001 J 110100000 Q 1101000101 Z 1101000100 } } # invert the encode map to make the decode map set decodeMap {} foreach {out in} $encodeMap { lappend decodeMap $in $out } switch -- $mode { e - encode { regsub -all {[^A-Z]} [string toupper $data] "" data string map $encodeMap $data } d - decode { string map $decodeMap $data } default {error "usage: huffman2 (encode|decode) data"} } }
RS: Indeed... I was afraid that e.g. 100=>E would also hit on 01001=>F, because "100" is a substring there, but your code works just perfect... Thank you for this lesson in string maps power ;-)
Continued in Huffman coding, part 2See also Adaptive Huffman Coding