- Data: abcdefghijAabcdefBCDdefEFG
- Output: abcdefghijA{offset:11, length:6}BCD{offset:6, length:3}EFG
- Data: abcdef bcde bcde bcde bcde 123
- Output: abcdef {offset:6, length:20}123
- Matches are encoded by an escape character, then the length, then the offset
- If both the length and the offset are <= 255, then the escape char is \x01. The length and offset are both encoded in one char * If either the length or offset are > 255, then the escape char is \x02. The length and offset are both encoded in two chars (one short)
- If an escape char needs to be outputed, it followed by \x01 to indicate that it is a literal character, not an escape (ie, to escape it)
- For short matches (\x01 escapes) the match must be at least 3 characters long)
- For long matches (\x02 escapes) the match must be at least 5 characters long)
# ############################################# # FILE: lz77.tcl - the code for the compression namespace eval lz77 { variable lookback 255 variable Escape1 "\x01" variable Escape2 "\x02" variable EscEsc "\x01" set K32 [expr {32 * 1024}] } proc ::lz77::K {a b} { return $a } proc ::lz77::encode {data} { variable Escape1 variable Escape2 variable EscEsc variable K32 set output "" set dataLength [string length $data] append output [string index $data 0] for {set i 1} {$i < $dataLength} {incr i} { #puts -nonewline "\n$i: " set foundAt 0 set foundLen 0 # Get the first 32k bytes of the current string set found [maxSubstring $i $data length offset] if { $found && $length > 0 } { if { ($offset < 255) && ($foundLen < 255) } { set offChar [format %c $offset] set lenChar [format %c $length] append output $Escape1$lenChar$offChar incr i $length incr i -1 } else { append output $Escape2 append output [binary format S $length] append output [binary format S $offset] incr i $length incr i -1 } } else { set char [string index $data $i] if { [string equal $char "$Escape1"] } { append output "$char$EscEsc" } elseif { [string equal $char "$Escape2"] } { append output "$char\x00$EscEsc" } else { append output $char } } } return $output } proc ::lz77::decode {data {original ""}} { variable Escape1 variable Escape2 variable EscEsc set output "" for {set i 0} {$i < [string length $data]} {incr i} { set char [string index $data $i] if { ($char ne "$Escape1") && ($char ne "$Escape2") } { append output $char } elseif { ($char eq "$Escape1") && [string index $data [expr {$i + 1}]] eq "$EscEsc" } { append output $char incr i } elseif { ($char eq "$Escape2") && ([string index $data [expr {$i + 1}]] eq "\x00") && ([string index $data [expr {$i + 2}]] eq "\x01") } { append output $char incr i 2 } else { if { $char eq $Escape1 } { scan [string index $data [incr i]] %c length scan [string index $data [incr i]] %c offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} { set char [string index $output $index] append output $char incr index } } else { binary scan [string range $data [incr i] [incr i]] S length binary scan [string range $data [incr i] [incr i]] S offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} { set char [string index $output $index] append output $char incr index } } } } return $output } proc ::lz77::maxSubstring {index data &matchLength &matchOffset} { upvar ${&matchLength} returnLength upvar ${&matchOffset} returnOffset variable K32 set matchLength 0 set matchOffset 0 if { $index > $K32 } { set data [string range $data [expr {$index - $K32}] end] set index $K32 } set upperBound [expr {1 + [string length $data] - $index}] if { $upperBound >= $K32 } { set upperBound $K32 } set lowerBound 0 while { ($upperBound > 3) && (($upperBound - $lowerBound) > 1) } { set currentLength [expr {($upperBound + $lowerBound)/2}] if { $currentLength < 3 } { set currentLength 3 } set currString [string range $data $index \ [expr {$index + $currentLength -1}]] set currData [string range $data 0 [expr {$index + $currentLength -2}]] set found 0 if { ([set matchIndex [string last $currString $currData]] >= 0) } { set len [string length $currString] if { ($index - $matchIndex) > 255 && $len < 5 } { if { 5 > $matchLength } { set currString2 [string range $data $index [expr {$index +4}]] set matchIndex2 [string first $currString2 $currData $matchIndex] if { $matchIndex2 >= 0 } { set matchOffset [expr {$index - $matchIndex2}] set matchLength 5 set found 1 } } } else { if { $len > $matchLength } { set matchOffset [expr {$index - $matchIndex}] set matchLength $len set found 1 } } } if { $found } { set lowerBound $currentLength } else { set upperBound $currentLength } } if { $matchLength > 0 } { set returnOffset $matchOffset set returnLength $matchLength return 1 } else { set returnOffset 0 set returnLength 0 return 0 } }And, the tests...
# ############################################# # FILE: lz77.test - tests for lz77 package require tcltest namespace import tcltest::* source [file join [file dirname [info script]] lz77.tcl] # ======================================== test encode-1.1 { encode a simple string in using LZ77 } -body { set encoded [::lz77::encode {abcdebcdef}] set expect "abcde\x01\x04\x04" append expect f if { [string length $expect] != [string length $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "ENCODED: $exList -> $enList" return "String lengths were not the same: \ [string length $expect] != [string length $encoded]" } if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "EXPECTED: $exList\nENCODED : $enList" return "Strings were not equal" } } -result {} test encode-1.2 { encode a simple string in using LZ77 } -body { set encoded [::lz77::encode {Blah blah blah blah blah!}] set expect "Blah b\x01\x12\x05!" if { [string length $expect] != [string length $encoded] } { return "String lengths were not the same: \ [string length $expect] != [string length $encoded]" } if { ![string equal $expect $encoded] } { return "Strings were not equal" } } -result {} test encode-1.3 { encode a string with multiple matches } -body { set string {This is a string with multiple strings within it} set expect "This \x01\x03\x03" append expect "a string with multiple\x01\x07\x15" append expect "s\x01\x05\x16" append expect "in it" set encoded [::lz77::encode $string] if { [string length $expect] != [string length $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "EXPECT: $exList" puts "ENCODE: $enList" return "String lengths were not the same: \ [string length $expect] != [string length $encoded]" } if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "EXPECT: $exList" puts "ENCODE: $enList" return "Strings were not equal" } } -result {} test encode-1.4 { Encode a long (>255 <255*255) string, to use second escape } -setup { set original "abcdefghij" for {set i 0} {$i < 254} {incr i} { append original [expr {$i % 10}] } append original "abcdefg" set expect "abcdefghij0123456789" append expect "\x01\xF4\x0a" append expect "\x02\x00\x07\x01\x08" } -body { set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "EXPECT: $exList" puts "ENCODE: $enList" return "Strings were not equal" } } -result {} # ======================================== test decode-1.1 { decode a simple string using LZ77 } -body { set decoded [::lz77::decode "Blah b\x01\x12\x05!"] set expect {Blah blah blah blah blah!} if { [string length $expect] != [string length $decoded] } { puts "$expect != $decoded" return "String lengths were not the same: \ [string length $expect] != [string length $decoded]" } if { ![string equal $expect $decoded] } { puts "$expect != $decoded" return "Strings were not equal" } } -result {} # ======================================== test cycle-1.1 { cycle a string through encode and decode } -body { set original "This is a string with multiple strings within it" set encoded [::lz77::encode $original] set changed [::lz77::decode $encoded] if { ![string equal $original $changed] } { puts "Not Equal:\nORIGINAL: $original\nENCODED : $changed" binary scan $encoded c* enList puts $enList return "The strings were not equal" } return } -result {} test cycle-1.2 { cycle a string through encode and decode } -body { set original "This is a string of text, \ whereherehereherehe parts of the string\ have text that is in other parts of the string" set encoded [::lz77::encode $original] set changed [::lz77::decode $encoded] if { ![string equal $original $changed] } { puts "Not Equal:\nORIGINAL: $original\nENCODED : $changed" binary scan $encoded c* enList puts $enList return "The strings were not equal" } return } -result {} test cycle-1.3 { Encode a long (>255 <255*255) string, to use second escape\ and decode it } -setup { set original "abcdefghij" for {set i 0} {$i < 254} {incr i} { append original [expr {$i % 10}] } append original "abcdefg" } -body { set encoded [::lz77::encode $original] set decoded [::lz77::decode $encoded] if { ![string equal $original $decoded] } { binary scan $original c* orList binary scan $decoded c* deList puts "ORIGINAL: $orList" puts "DECODED : $deList" return "Strings were not equal" } } -result {} test cycle-1.4 { For a very long string (>32k), remove everything 32k \ past the index } -constraints { slow } -setup { unset -nocomplain length offset set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "0123456789012345678901234567890123456789" } append original "abcdef" } -body { set decoded [::lz77::decode [::lz77::encode $original]] string equal $decoded $original } -result {1} test cycle-1.5 { For a very long string (>32k), remove anything more than 32k\ before the index } -constraints { slow } -setup { unset -nocomplain length offset set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "0123456789012345678901234567890123456789" append original "0123456789012345678901234567890123456789" append original "0123456789012345678901234567890123456789" } append original "abcdef" } -body { set decoded [::lz77::decode [::lz77::encode $original]] string equal $decoded $original } -result {1} test cycle-1.6 { } -setup { set data "abcdefghij012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123abcdefg" set index 274 } -body { set decoded [::lz77::decode [::lz77::encode $original]] string equal $decoded $original } -result {1} test cycle-1.7 { cycle a string through encode and decode } -setup { set f [open [file join [file dirname [info script]] lz77.tcl]] set original [read $f] close $f } -body { set encoded [::lz77::encode $original] set changed [::lz77::decode $encoded] if { ![string equal $original $changed] } { set oList [split $original \n] set cList [split $changed \n] puts "original has [llength $oList] lines...\ changed has [llength $cList] lines" set line 0 set chars 0 foreach oLine $oList cLine $cList { incr line if { ![string equal $oLine $cLine] } { return "Line $line was difference\ \n\tOriginal: '$oLine'\ \n\tChanged : '$cLine'\ \n\t($chars from preceding lines)" } incr chars [string length $oLine] } #puts "Not Equal:\nORIGINAL: $original\nENCODED : $changed" return "The strings were not equal" } return } -result {} test cycle-1.8 { cycle a string through encode and decode } -setup { set original { scan [string index $data [incr i]] %c offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} - append output [string index $output $index] incr index - - else - binary scan [string range $data [incr i] [incr i]] S length binary scan [string range $data [incr i] [incr i]] S offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} - append output [string index $output $index] incr index - - } } -body { set encoded [::lz77::encode $original] set changed [::lz77::decode $encoded $original] if { ![string equal $original $changed] } { set oList [split $original \n] set cList [split $changed \n] puts "original has [llength $oList] lines...\ changed has [llength $cList] lines" set line 0 foreach oLine $oList cLine $cList { incr line if { ![string equal $oLine $cLine] } { return "Line $line was difference\ \n\tOriginal: '$oLine'\ \n\tChanged : '$cLine'" } } #puts "Not Equal:\nORIGINAL: $original\nENCODED : $changed" return "The strings were not equal" } return } -result {} # ======================================== # Special cases # When there's an escape in the input text # Encode it as \x01\x01, since we can't have a repeat length of 1 test escape-1.1 { An escape in the input data is coded as the escape, followed\ by \x01 } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x01" ; append original cd set expect "ab\x01\x01" ; append expect cd set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "$exList ->\n$enList" return "The strings were not equal" } set decoded [::lz77::decode $encoded] if { ![string equal $decoded $original] } { return "Decoded != Original" } } -result {} test escape-1.2 { An escape in the input data is coded as the escape, followed\ by \x01 } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x01" ; append original cd set encoded [::lz77::encode $original] set decoded [::lz77::decode $encoded] if { ![string equal $original $decoded] } { binary scan $original c* orList binary scan $decoded c* deList puts "$orList ->\n$deList" return "The strings were not equal" } } -result {} test escape-1.3 { An escaped escape should not interfere with runs surrounding it } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "abcdebcde\x01" ; append original cd set expect "abcde\x01\x04\x04\x01\x01" ; append expect cd set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "$exList ->\n$enList" return "The strings were not equal" } set decoded [::lz77::decode $encoded] if { ![string equal $decoded $original] } { return "Decoded != Original" } } -result {} test escape-1.4 { An escape in the input data is coded as the escape, followed\ by \x01 } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x02" ; append original cd set expect "ab\x02\x00\x01" ; append expect cd set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "$exList ->\n$enList" return "The strings were not equal" } set decoded [::lz77::decode $encoded] if { ![string equal $decoded $original] } { return "Decoded != Original" } } -result {} test escape-1.5 { An escape in the input data is coded as the escape, followed\ by \x01 } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x02" ; append original cd set encoded [::lz77::encode $original] set decoded [::lz77::decode $encoded] if { ![string equal $original $decoded] } { binary scan $original c* orList binary scan $decoded c* deList puts "$orList ->\n$deList" return "The strings were not equal" } } -result {} test escape-1.6 { Both escapes } -setup { unset -nocomplain original expect encoded exList enList } -body { set original "ab\x02\x01\x01\x02cd" ; append original cd set encoded [::lz77::encode $original] set decoded [::lz77::decode $encoded] if { ![string equal $original $decoded] } { binary scan $original c* orList binary scan $decoded c* deList puts "$orList ->\n$deList" return "The strings were not equal" } } -result {} # When the input text is > 255 chars long, since that's # the max we can look back, or have in a run # Check that the string we're looking for is no longer than 255 # Check that we don't look back more than 255 # If there's multiple matches, get the longest one possible # "These blah is blah blah blah!" # ^ the match for here # ^ should start here # ^ not here test longest-1.1 { Get the longest match possible } -setup { unset -nocomplain original expect encoded exList enList set original {These blah is blah blah blah!} set expect "These blah is\x01\x06\x08" append expect "\x01\x09\x05!" } -body { set encoded [::lz77::encode $original] if { ![string equal $expect $encoded] } { binary scan $expect c* exList binary scan $encoded c* enList puts "Expect: $exList ->\nEncode: $enList" puts "DECODE: [::lz77::decode $encoded]" return "The strings were not equal" } set decoded [::lz77::decode $encoded] if { ![string equal $decoded $original] } { return "Decoded != Original" } } -result {} # ======================================== test maxSubstring-1.1 { Find the max substring for a string with only one match } -setup { unset -nocomplain length offset set string {abcdefcdefg} set index 6 } -body { set flag [::lz77::maxSubstring $index $string length offset] list $flag $length $offset } -result {1 4 4} test maxSubstring-1.2 { Make sure we can get matches of the min length } -setup { unset -nocomplain length offset set string {This is a} set index 5 } -body { set flag [::lz77::maxSubstring $index $string length offset] list $flag $length $offset } -result {1 3 3} test maxSubstring-1.3 { Return 0 if no match } -setup { unset -nocomplain length offset set string {abcdefghijk} set index 6 } -body { ::lz77::maxSubstring $index $string length offset } -result {0} test maxSubstring-1.4 { Find the max substring for a string with multiple matches } -setup { unset -nocomplain length offset set string {aaaabbbbaaaaaaaa} set index 9 } -body { set flag [::lz77::maxSubstring $index $string length offset] list $flag $length $offset } -result {1 7 1} test maxSubstring-1.5 { For a very long string (>32k), remove everything 32k \ past the index } -setup { unset -nocomplain length offset set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "0123456789012345678901234567890123456789" } append original "abcdef" set index 17 } -body { set flag [::lz77::maxSubstring $index $original length offset] list $flag $length $offset } -result {1 32767 10} test maxSubstring-1.6 { For a very long string (>32k), remove anything more than 32k\ before the index } -constraints { slow } -setup { unset -nocomplain length offset set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "0123456789012345678901234567890123456789" append original "0123456789012345678901234567890123456789" append original "0123456789012345678901234567890123456789" } append original "abcdef" set index 40000 } -body { set flag [::lz77::maxSubstring $index $original length offset] list $flag $length $offset } -result {1 32767 10} test maxSubstring-1.7 { } -setup { set data "abcdefghij012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123abcdefg" set index 274 } -body { set flag [::lz77::maxSubstring $index $data length offset] list $flag $length $offset } -result {1 7 274} test maxSubstring-1.8 { cycle a string through encode and decode } -setup { set original { scan [string index $data [incr i]] %c offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} - append output [string index $output $index] incr index - - else - binary scan [string range $data [incr i] [incr i]] S length binary scan [string range $data [incr i] [incr i]] S offset set index [expr {[string length $output] - $offset}] for {set j 0} {$j < $length} {incr j} - append output [string index $output $index] incr index - - } set index 466 } -body { set flag [::lz77::maxSubstring $index $original length offset] list $flag $length $offset } -result {1 258 411} # ======================================== cleanupTests return # ======================================== # Some timings puts SHORT:[time {::lz77::maxSubstring 9 aaaabbbbaaaaaaaa v1 v2} 10000] set original "abcdef" for {set i 0} {$i < 1024} {incr i} { append original "01234567890123456789012345678901" } append original "abcdef" puts LONG:[time {::lz77::maxSubstring 3 $original v1 v2} 1]
PS 10Sep04: This brings us terribly close to pure tcl deflate/gzip. Reading http://www.gzip.org/deflate.html suggests that the Huffman coding is not terribly hard to do.DKF: I'd estimate that the easiest way to speed this up would be to make [maxSubstring] return a list of its three result values instead of using upvar. Splitting the result list with foreach or lassign is much faster...MS The usage of [K] can also be sped up noticeably: use the idiom [K $x [set x {}]] instead of [K $x [unset x]]. Even faster is the bytecompiled variant [lindex [list $x [set x{}]] 0]. Which reminds me ... soon (?) to be the fastest is the plain $x[set x {}] - as soon as I get around to implementing FRE 924097 (optimise {}-appends [1]) This optimisation has been committed to HEAD on 2004-09-10 (8.5a2)Peter Spjuth: Is it intentional that maxSubstring prefers the earliest match? I would think that minimizing offset by prefering the latest match would be better. That would make a more efficient solution possible using [string last]. Also, is K really useful with [string range]?RHS To add some comments...
- No, K is not amazingly useful with string range. In most instances, it's slower to use K than to not. I'll retime things using [K $data [set data {}]] to see if that helps (and the bytecompiled version of lindex)
- It's not intentional that it prefers the earlier match. In fact, the later the match the better. However, I was having issues with [string last] and the fact that it had to search through data that wasn't an acceptable match (it started past the index). My algorithm has since changed, so I'll probably be changing it to use [string last]
- All those micro-optimizations, and I forgot the first rule... first, make your algorithm faster. I've since recoded maxSubstring to do a binary search instead of a linear one, and it's sped things up "considerably". My current implementation is broken (it passes the maxSubstring tests, but not the other ones, so maxSubstring tests are missing something), but I'll post the new code once it works.