Updated 2013-06-28 09:07:04 by suchenwi

-- Another weekend fun project by Richard Suchenwirth:

Occurrence frequencies of strings are useful in a number of applications: text statistics, document indexing, language determination, ... In Tcl, this is easily done with an array whose keys contain the strings, and the values have the count. Following is a more elaborate version with a user interface somehow similar to a Tk widget, but since it's pure Tcl, I call such things "gadgets" ;-) (see also A matrix gadget). Optional parameters are passed in the -key value style. Keywords may be abbreviated as long as they're unique. This is easily achieved by requiring
llength [array names T $key*]==1

It would be >1 for ambiguous abbrevs, and 0 for undefined keys.

Tally features include counting words, characters, or sequences (N-grams) of these. Beware that the number of character ngrams grows very fast, execution gets slow! When counting words, a regular expression to match may be specified, and/or a list of words to omit. No preprocessing regarding case, punctuation..., is done internally yet, so should be done before.

USAGE
tally <T> -unit <word|char|ngram> -omit <list> -match <re>
<T> add <string>
<T> destroy
<T> flist -min <int> -mode <abs|both>
<T> names
<T> size

The proc tally is called both for creating a tally, and by redirection from the object commands, e.g. T add "hello world" is reformatted to tally - T add "hello world".

Implementation is by an array that holds both the counts and a number of parameters which start with an underscore -- so strings starting with _ should better not be tallied ;-( Hmm.. this needs some more mystification.
 proc tally {name args} {
     set create 1; if {$name=="-"} {
         set create 0; set name [lindex $args 0]
         set cmd [lindex $args 1]; set args [lrange $args 2 end]
     }
     upvar $name T
     if {$create} {
         if [info exists T] {error "tally: couldn't create $name: exists"}
         array set T {_Total 0 _title "" _unit word _omit "" _match .+}
         foreach {key value} $args {
             regsub "^-" $key "_" _key
             set akey [array names T $_key*]
             if {[llength $akey]!=1} {error "tally: bad option $key"}
             set T($akey) $value
         }
         proc $name {args} "upvar $name T; eval tally - T \$args"
         trace variable T u "rename $name {};#"
         # preceding comment is NOT optional - shields added args!
     } else {
         switch -- $cmd {
             add    {eval tally:add T $args}
             destroy {unset T; rename $name ""}
             flist  {eval tally:flist T $args}
             names  {tally:names T}
             size   {llength [tally - T names]} 
         }
     }
 }
# The real work is done in these farmed-out procs: 
 proc tally:add {name input} {
     upvar $name T
     set omit $T(_omit)
     set unit $T(_unit)
     set re ^$T(_match)$
     switch -- $unit {
         word   {set list [split $input]}
         wngram {set list [subseq [split $input]]}
         char   {set list [split $input ""]}
         ngram  {set list [subseq [split $input ""]]}
     }
     foreach i $list {
         if {$i=={}} continue
         incr T(_Total)
         if {$unit=="word" && [lsearch -exact $omit $i]>=0} continue
         if {$unit=="ngram"} {set i [join $i ""]}
         if ![regexp $re $i] continue
         if [info exists T($i)] {incr T($i)} else {set T($i) 1}
     }
 }
 proc tally:flist {name args} {
     upvar $name T
     array set opt {-min 1 -mode abs}
     array set opt $args
     set t100 [expr $T(_Total)/100.]
     set mode $opt(-mode)
     set t [list [list Total: $T(_Total)]]
     foreach {s n} [array get T] {
         if {![regexp ^_ $s] && $n>=$opt(-min)} {
             switch -- $mode {
                 abs {lappend t [list $s $n] ;# make list of pairs}
                 both {lappend t [list $s $n [expr $n/$t100]]}
             }
         }
     }
     lsort -integer -index 1 -decreasing $t
 }
 proc tally:names name {
     upvar $name T
     set res {}
     foreach i [array names T] {if ![regexp ^_ $i] {lappend res $i}}
     lsort $res
 }
#This produces all subsequences of a list:
# subseq {A B C} -> {A {A B} {A B C} B {B C} C}
 proc subseq {list} {
     set res {}
     for {set i 0} {$i<[llength $list]} {incr i} {
         for {set j $i} {$j<[llength $list]} {incr j} {
             lappend res [lrange $list $i $j]
         }
     }
     set res
 }
#Testing the whole thing: let tally tally its own source (this page).
 proc tally:test {} {uplevel #0 {
     catch {unset T2}
     tally T2 -unit word -match {[A-Za-z0-9]+} -omit {
         a also an and another are because both but by could couldn't 
         for have haven't if in is isn't it it's not of or since some somehow
         that the this to when where while with
     }
     set f [open tally.tcl r]
     foreach i [split [read $f] \n] {regsub -all _ $i "" i; T2 add $i}
     T2 flist -min 2 -mode abs
 }   }
 if 1 tally:test

Also see "frequency calculation" and subsequences. Further see Tcllib's counter module/package.