Updated 2012-09-12 23:19:03 by RLE

SEH tdelta.tcl:

Produce an rdiff-style delta signature of one file with respect to another, and re-create one file by applying the delta to the other.

A weak "rolling" checksum is used in the manner of rsync; that is, a digest of weak and strong checksums of file blocks is created from a target file, then the entire reference file is weak checksummed on a rolling basis and compared with the digest for matching blocks. The delta signature is a combination of references to matching blocks and non-matching content.

File attribute information is also stored so that not only contents but permissions, times, etc. are faithfully reconstituted.

Strong md5 checksums of file segments are stored in the delta signature, which reduces efficiency but enhances security. Less than full-length checksums can be optionally stored to make the deltas smaller.

Update 10/02/04 -- Less brain-dead serialization of matching-block data produces much smaller deltas. Also, open channels can be passed to tdelta and patch instead of file names, thus you can avoid necessity of temp files if that suits your preference.

4/11/05 -- tweaks for speed, tidiness and smaller deltas.
 # tdelta.tcl --
 #
 #        Produce an rdiff-style delta signature of one file with respect to another,
 #        and re-create one file by applying the delta to the other.
 #
 # Written by Stephen Huntley (stephen.huntley@alum.mit.edu)
 #
 # Usage:
 #
 # tdelta <reference file | channel> <target file | channel> [sizecheck [fingerprint]]
 #        Returns a delta of the target file with respect to the reference file. 
 #        i.e., using patch to apply the delta to the target file will re-create the reference file.
 #
 #        sizecheck and fingerprint are booleans which enable time-saving checks: 
 #
 #        if sizecheck is True then if the file size is
 #        less than five times the block size, then no delta calculation is done and the
 #        signature contains the full reference file contents.  
 #
 #        if fingerprint is True then 10 small strings ("fingerprints") are taken from the target
 #        file and searched for in the reference file.  If at least three aren't found, then
 #        no delta calculation is done and the signature contains the full reference file contents.
 #
 # tpatch <target file | channel> <delta signature> <output file (duplicate of reference file) | channel>
 #        Reconstitute original reference file by applying delta to target file.
 #
 #
 # global variables:
 #
 # blockSize
 #        Size of file segments to compare.
 #         Smaller blockSize tends to create smaller delta.
 #         Larger blockSize tends to take more time to compute delta.
 # md5Size
 #        Substring of md5 checksum to store in delta signature.
 #        If security is less of a concern, set md5Size to a number
 #        between 1-32 to create a more compact signature.

 namespace eval ::trsync {

 if ![info exists blockSize] {variable blockSize 100}
 if ![info exists Mod] {variable Mod [expr pow(2,16)]}
 if ![info exists md5Size] {variable md5Size 32}

 variable temp
 if ![info exists temp] {
        catch {set temp $::env(TMP)}
        catch {set temp $::env(TEMP)}
        catch {set temp $::env(TRSYNC_TEMP)}
        if [catch {file mkdir $temp}] {set temp [pwd]}
 }
 if ![file writable $temp] {error "temp location not writable"}

 proc Backup {args} {
        return
 }

 proc ConstructFile {copyinstructions {eolNative 0} {backup {}}} {
        if [catch {package present md5 2}] {package forget md5 ; package require md5 2}

        set fileToConstruct [lindex $copyinstructions 0]
        set existingFile [lindex $copyinstructions 1]
        set blockSize [lindex $copyinstructions 2]
        array set fileStats [lindex $copyinstructions 3]
        array set digestInstructionArray [DigestInstructionsExpand [lindex $copyinstructions 4] $blockSize]
        array set dataInstructionArray [lindex $copyinstructions 5]
        unset copyinstructions

        if {[lsearch [file channels] $existingFile] == -1} {
                set existingFile [FileNameNormalize $existingFile]
                if {$fileToConstruct == {}} {file delete -force $existingFile ; return}
                catch {
                        set existingID [open $existingFile r]
                        fconfigure $existingID -translation binary
                }
        } else {
                set existingID $existingFile
                fconfigure $existingID -translation binary
        }

        set temp $::trsync::temp

        if {[lsearch [file channels] $fileToConstruct] == -1} {
                set fileToConstruct [FileNameNormalize $fileToConstruct]
                set constructTag "trsync.[md5::md5 -hex "[clock seconds] [clock clicks]"]"
                set constructID [open $temp/$constructTag w]
        } else {
                set constructID $fileToConstruct
        }
        fconfigure $constructID -translation binary

        if $eolNative {set eolNative [string is ascii -strict [array get dataInstructionArray]]}

        set filePointer 1
        while {$filePointer <= $fileStats(size)} {
                if {[array names dataInstructionArray $filePointer] != {}} {
                        puts -nonewline $constructID $dataInstructionArray($filePointer)
                        set segmentLength [string length $dataInstructionArray($filePointer)]
                        array unset dataInstructionArray $filePointer
                        set filePointer [expr $filePointer + $segmentLength]
                } elseif {[array names digestInstructionArray $filePointer] != {}} {
                        if ![info exists existingID] {error "Corrupt copy instructions."}
                        set blockNumber [lindex $digestInstructionArray($filePointer) 0]
                        set blockMd5Sum [lindex $digestInstructionArray($filePointer) 1]

                        seek $existingID [expr $blockNumber * $blockSize]

                        set existingBlock [read $existingID $blockSize]
                        set existingBlockMd5Sum [string range [md5::md5 -hex -- $existingBlock] 0 [expr [string length $blockMd5Sum] - 1]]
                        if ![string equal -nocase $blockMd5Sum $existingBlockMd5Sum] {error "digest file contents mismatch"}
                        puts -nonewline $constructID $existingBlock

                        if $eolNative {set eolNative [string is ascii -strict $existingBlock]}
                        unset existingBlock
                        set filePointer [expr $filePointer + $blockSize]
                } else {
                        error "Corrupt copy instructions."
                }
        }
        catch {close $existingID}
        set fileStats(eolNative) $eolNative
        if {[lsearch [file channels] $fileToConstruct] > -1} {return [array get fileStats]}

        close $constructID

        if $eolNative {
                fcopy [set fin [open $temp/$constructTag r]] [set fout [open $temp/${constructTag}fcopy w]]
                close $fin
                close $fout
                file delete -force $temp/$constructTag
                set constructTag "${constructTag}fcopy"
        }

        catch {file attributes $temp/$constructTag -readonly 0} result
        catch {file attributes $temp/$constructTag -permissions rw-rw-rw-} result
        catch {file attributes $temp/$constructTag -owner $fileStats(uid)} result
        catch {file attributes $temp/$constructTag -group $fileStats(gid)} result
        catch {file mtime $temp/$constructTag $fileStats(mtime)} result
        catch {file atime $temp/$constructTag $fileStats(atime)} result
        if [string equal $fileToConstruct $existingFile] {
                catch {file attributes $existingFile -readonly 0} result
                catch {file attributes $existingFile -permissions rw-rw-rw-} result
        }

        Backup $backup $fileToConstruct

        file mkdir [file dirname $fileToConstruct]
        file rename -force $temp/$constructTag $fileToConstruct
        array set attributes $fileStats(attributes)
        array set attrConstruct [file attributes $fileToConstruct]
        foreach attr [array names attributes] {
                if [string equal [array get attributes $attr] [array get attrConstruct $attr]] {continue}
                if {[string equal $attr "-longname"] || [string equal $attr "-shortname"] || [string equal $attr "-permissions"]} {continue}
                catch {file attributes $fileToConstruct $attr $attributes($attr)} result
        }
        catch {file attributes $fileToConstruct -permissions $fileStats(mode)} result
        return
 }

 proc CopyInstructions {filename digest} {
        if [catch {package present md5 2}] {package forget md5 ; package require md5 2}

        if {[lsearch [file channels] $filename] == -1} {
                set filename [FileNameNormalize $filename]
                file stat $filename fileStats
                array set fileAttributes [file attributes $filename]
                array unset fileAttributes -longname
                array unset fileAttributes -shortname
                set arrayadd attributes ; lappend arrayadd [array get fileAttributes] ; array set fileStats $arrayadd
                set f [open $filename r]
        } else {
                set f $filename
                set fileStats(attributes) {}
        }
        fconfigure $f -translation binary
        seek $f 0 end
        set fileSize [tell $f]
        seek $f 0
        set fileStats(size) $fileSize
        set digestFileName [lindex $digest 0]
        set blockSize [lindex $digest 1]
        set digest [lrange $digest 2 end]

        if {[lsearch -exact $digest fingerprints] > -1} {
                set fingerPrints [lindex $digest end]
                set digest [lrange $digest 0 end-2]
                set fileContents [read $f]
                set matchCount 0
                foreach fP $fingerPrints {
                        if {[string first $fP $fileContents] > -1} {incr matchCount}
                        if {$matchCount > 3} {break}
                }
                unset fileContents
                seek $f 0
                if {$matchCount < 3} {set digest {}}
        }

        set digestLength [llength $digest]
        for {set i 0} {$i < $digestLength} {incr i} {
                set arrayadd [lindex [lindex $digest $i] 1]
                lappend arrayadd $i
                array set Checksums $arrayadd
        }
        set digestInstructions {}
        set dataInstructions {}
        set weakChecksum {}
        set startBlockPointer 0
        set endBlockPointer 0

        if ![array exists Checksums] {
                set dataInstructions 1
                lappend dataInstructions [read $f]
                set endBlockPointer $fileSize
        }

        while {$endBlockPointer < $fileSize} {
                set endBlockPointer [expr $startBlockPointer + $blockSize]
                incr startBlockPointer
                if {$weakChecksum == {}} {
                        set blockContents [read $f $blockSize]
                        set blockNumberSequence [SequenceBlock $blockContents]
                        set weakChecksumInfo [WeakChecksum $blockNumberSequence]
                        set weakChecksum [format %.0f [lindex $weakChecksumInfo 0]]
                        set startDataPointer $startBlockPointer
                        set endDataPointer $startDataPointer
                        set dataBuffer {}
                }
                if {[array names Checksums $weakChecksum] != {}} {
                        set md5Sum [md5::md5 -hex -- $blockContents]
                        set blockIndex $Checksums($weakChecksum)
                        set digestmd5Sum [lindex [lindex $digest $blockIndex] 0]
                        if [string equal -nocase $digestmd5Sum $md5Sum] {
                                if {$endDataPointer > $startDataPointer} {
                                        lappend dataInstructions $startDataPointer
                                        lappend dataInstructions $dataBuffer
                                }
                                lappend digestInstructions $startBlockPointer
                                lappend digestInstructions "$blockIndex [string range $md5Sum 0 [expr $::trsync::md5Size - 1]]"
                                set weakChecksum {}
                                set startBlockPointer $endBlockPointer
                                continue
                        }
                }
                if {$endBlockPointer >= $fileSize} {
                        lappend dataInstructions $startDataPointer
                        lappend dataInstructions $dataBuffer$blockContents
                        break
                }
                set rollChar [read $f 1]
                binary scan $rollChar c* rollNumber
                set rollNumber [expr ($rollNumber + 0x100)%0x100]
                lappend blockNumberSequence $rollNumber
                set blockNumberSequence [lrange $blockNumberSequence 1 end]

                binary scan $blockContents a1a* rollOffChar blockContents
                set blockContents $blockContents$rollChar
                set dataBuffer $dataBuffer$rollOffChar
                incr endDataPointer

                set weakChecksumInfo "[eval RollChecksum [lrange $weakChecksumInfo 1 5] $rollNumber] [lindex $blockNumberSequence 0]"
                set weakChecksum [format %.0f [lindex $weakChecksumInfo 0]]
        }
        close $f

        lappend copyInstructions $filename
        lappend copyInstructions $digestFileName
        lappend copyInstructions $blockSize
        lappend copyInstructions [array get fileStats]
        lappend copyInstructions [DigestInstructionsCompress $digestInstructions $blockSize]
        lappend copyInstructions $dataInstructions
        return $copyInstructions
 }

 proc Digest {filename blockSize {sizecheck 0} {fingerprint 0}} {
        if [catch {package present md5 2}] {package forget md5 ; package require md5 2}

        set digest "[list $filename] $blockSize"
        if {[lsearch [file channels] $filename] == -1} {
                set filename [FileNameNormalize $filename]
                set digest "[list $filename] $blockSize"
                if {!([file isfile $filename] && [file readable $filename])} {return $digest}
                set f [open $filename r]
        } else {
                set f $filename
        }
        fconfigure $f -translation binary
        seek $f 0 end
        set fileSize [tell $f]
        seek $f 0
        if {$sizecheck && ($fileSize < [expr $blockSize * 5])} {close $f ; return $digest}

        while {![eof $f]} {
                set blockContents [read $f $blockSize]
                set md5Sum [md5::md5 -hex -- $blockContents]
                set blockNumberSequence [SequenceBlock $blockContents]
                set weakChecksum [lindex [WeakChecksum $blockNumberSequence] 0]
                lappend digest "$md5Sum [format %.0f $weakChecksum]"
        }
        if $fingerprint {
                set fileIncrement [expr $fileSize/10]
                set fpLocation [expr $fileSize - 21]
                set i 0
                while {$i < 10} {
                        if {$fpLocation < 0} {set fpLocation 0}
                        seek $f $fpLocation
                        lappend fingerPrints [read $f 20]
                        set fpLocation [expr $fpLocation - $fileIncrement]
                        incr i
                }
                lappend digest fingerprints
                lappend digest [lsort -unique $fingerPrints]
        }
        close $f
        return $digest
 }

 proc DigestInstructionsCompress {digestInstructions blockSize} {
        if [string equal $digestInstructions {}] {return {}}
        set blockSpan $blockSize
        foreach {pointer blockInfo} $digestInstructions {
                if ![info exists currentBlockInfo] {
                        set currentPointer $pointer
                        set currentBlockInfo $blockInfo
                        set md5Size [string length [lindex $blockInfo 1]]
                        continue
                }
                if {$pointer == [expr $currentPointer + $blockSpan]} {
                        set md5 [lindex $blockInfo 1]
                        lappend currentBlockInfo $md5
                        incr blockSpan $blockSize
                } else {
                        lappend newDigestInstructions $currentPointer
                        lappend newDigestInstructions "[lindex $currentBlockInfo 0] [list "$md5Size [string map {{ } {}} [lrange $currentBlockInfo 1 end]]"]"

                        set currentPointer $pointer
                        set currentBlockInfo $blockInfo
                        set blockSpan $blockSize
                }
        }
        lappend newDigestInstructions $currentPointer
        lappend newDigestInstructions "[lindex $currentBlockInfo 0] [list "$md5Size [string map {{ } {}} [lrange $currentBlockInfo 1 end]]"]"
        return $newDigestInstructions
 }

 proc DigestInstructionsExpand {digestInstructions blockSize} {
        if [string equal $digestInstructions {}] {return {}}
        foreach {pointer blockInfo} $digestInstructions {
                set blockNumber [lindex $blockInfo 0]
                set md5Size [lindex [lindex $blockInfo 1] 0]
                set blockString [lindex [lindex $blockInfo 1] 1]
                set blockLength [string length $blockString]

                set expandedBlock {}
                for {set i $md5Size} {$i <= $blockLength} {incr i $md5Size} {
                        append expandedBlock " [string range $blockString [expr $i - $md5Size] [expr $i - 1]]"
                }

                set blockInfo "$blockNumber $expandedBlock"
                foreach md5 [lrange $blockInfo 1 end] {
                        lappend newDigestInstructions $pointer
                        lappend newDigestInstructions "$blockNumber $md5"
                        incr pointer $blockSize
                        incr blockNumber
                }
        }
        return $newDigestInstructions
 }

 proc FileNameNormalize {filename} {
        file normalize $filename
 }

 proc RollChecksum {a(k,l)_ b(k,l)_ k l Xsub_k Xsub_l+1 } {
        set Mod $trsync::Mod

        set a(k+1,l+1)_ [expr ${a(k,l)_} - $Xsub_k + ${Xsub_l+1}]
        set b(k+1,l+1)_ [expr ${b(k,l)_} - (($l - $k + 1) * $Xsub_k) + ${a(k+1,l+1)_}]

        set a(k+1,l+1)_ [expr fmod(${a(k+1,l+1)_},$Mod)]
        set b(k+1,l+1)_ [expr fmod(${b(k+1,l+1)_},$Mod)]
        set s(k+1,l+1)_ [expr ${a(k+1,l+1)_} + ($Mod * ${b(k+1,l+1)_})]
        return "${s(k+1,l+1)_} ${a(k+1,l+1)_} ${b(k+1,l+1)_} [incr k] [incr l]"
 }

 proc SequenceBlock {blockcontents} {
        binary scan $blockcontents c* blockNumberSequence
        set blockNumberSequenceLength [llength $blockNumberSequence]
        for {set i 0} {$i < $blockNumberSequenceLength} {incr i} {
                set blockNumberSequence [lreplace $blockNumberSequence $i $i [expr ([lindex $blockNumberSequence $i] + 0x100)%0x100]]
        }
        return $blockNumberSequence
 }

 proc WeakChecksum {Xsub_k...Xsub_l} {
        set a(k,i)_ 0
        set b(k,i)_ 0
        set Mod $trsync::Mod
        set k 1
        set l [llength ${Xsub_k...Xsub_l}]
        for {set i $k} {$i <= $l} {incr i} {
                set Xsub_i [lindex ${Xsub_k...Xsub_l} [expr $i - 1]]
                set a(k,i)_ [expr ${a(k,i)_} + $Xsub_i]
                set b(k,i)_ [expr ${b(k,i)_} + (($l - $i + 1) * $Xsub_i)]
        }
        set a(k,l)_ [expr fmod(${a(k,i)_},$Mod)]
        set b(k,l)_ [expr fmod(${b(k,i)_},$Mod)]
        set s(k,l)_ [expr ${a(k,l)_} + ($Mod * ${b(k,l)_})]
        return "${s(k,l)_} ${a(k,l)_} ${b(k,l)_} $k $l [lindex ${Xsub_k...Xsub_l} 0]"
 }

 proc tdelta {referenceFile targetFile blockSize {sizecheck 0} {fingerprint 0}} {
        if {$::trsync::md5Size < 1} {error "md5Size must be greater than zero."}
        set signature [Digest $targetFile $blockSize $sizecheck $fingerprint]
        return [CopyInstructions $referenceFile $signature]
 }

 proc tpatch {targetFile copyInstructions fileToConstruct {eolNative 0}} {
        set copyInstructions [lreplace $copyInstructions 0 1 $fileToConstruct $targetFile]
        return [ConstructFile $copyInstructions $eolNative]
 }

 namespace export tdelta tpatch

 }
 # end namespace eval ::trsync

This is very cool ... how does it work? :) BTW, did you see in tcllib there's a longest-sublist-match thing which is talked about in this connection, and might be useful.

Are you thinking what I'm thinking? First a delta versioning FS, then a networked delta FS with optimistic checkin (CVS' big innovation, IMHO) leading to a really good pure-tcl s/w distribution mechanism? Or even a distributed VFS for text? - CMcC 20041029

SEH -- There are many possibilities, I wish I could code them faster. Right now I'm going to use this code to create a complementary virtual filesystem on which to stack a versioning virtual filesystem, which can serve as the backbone of a personal backup-archive utility. From there a distributed development-archive service shouldn't be a big leap.

jcw - By hacking things and adding the following at the end, I was able to try it out as command-line utility:
  source kitten.kit

  switch [llength $argv] {
    2 {
      set cmd tdelta
      lappend argv $blockSize
    }
    3 {
      set cmd patch
      set argv [lreplace $argv 1 1 [read [open [lindex $argv 1]]]]
    }
    default {
      puts stderr "
        Usage: $argv0 rfile tfile >delta
           or: $argv0 tfile delta ofile
      "
      exit 1
    }
  }

  puts -nonewline [eval [linsert $argv 0 $cmd]]

Could this be a useful mechanism to bring wikit difference history into the wikit.tkd database file? It sure looks intriguing...