Frequency calculation

Richard Suchenwirth 2002-08-01 - In Solving cryptograms, amazing ways of automated decryption were shown, and CL asked for accessories e.g. for computing character and digram (2-character sequence) frequencies. I could not withstand such a little challenge, so here goes. Following KBK, a dash (-) was taken to represent all non-letter characters. In the input the user has to take care of this himself, but the first and last character of the input text are also counted with added leading resp. trailing dash.

 proc freq12 string {
    # returns a pairlist: character/bigrams and the associated frequency
        set last -
        set n 0
        foreach char [split $string- ""] {
                incr n
                inc a($char)
                inc a($last$char)
                set last $char
        }
        set res {}
        foreach i [lsort [array names a]] {
                lappend res $i [expr {$a($i)*1./$n}]
        }
        set res
 }
 proc inc {varName {amount 1}} {
    # create a variable if not exists, then increment
    upvar 1 $varName var
    if {![info exists var]} {set var 0}
    incr var $amount
 }

 % freq12 "TCL-IS-A-SCRIPTING-LANGUAGE"
 - 0.178571428571 -A 0.0357142857143 -I 0.0357142857143 -L 0.0357142857143
 -S 0.0357142857143 -T 0.0357142857143 A 0.107142857143 A- 0.0357142857143
 AG 0.0357142857143 AN 0.0357142857143 C 0.0714285714286 CL 0.0357142857143
 CR 0.0357142857143 E 0.0357142857143 E- 0.0357142857143 G 0.107142857143
 G- 0.0357142857143 GE 0.0357142857143 GU 0.0357142857143 I 0.107142857143
 IN 0.0357142857143 IP 0.0357142857143 IS 0.0357142857143 L 0.0714285714286
 L- 0.0357142857143 LA 0.0357142857143 N 0.0714285714286 NG 0.0714285714286
 P 0.0357142857143 PT 0.0357142857143 R 0.0357142857143 RI 0.0357142857143
 S 0.0714285714286 S- 0.0357142857143 SC 0.0357142857143 T 0.0714285714286
 TC 0.0357142857143 TI 0.0357142857143 U 0.0357142857143 UA 0.0357142857143

[We could have a category just for "character-based management of textual corpora", or, perhaps more conventionally, "statistical text classification" or "n-grams".]

Oft-cited Damashek paper from '95: [1]. Also [2], [3], and [4].

DKF - Here's a version extended to copy with longer sequences too. It is a bit slower for the digram case, but much more flexible (not just in what it can produce, but also in what it can take as input.)

 proc freq {string {maxn 2}} {
    set len [string length $string]
    set s [string repeat - $maxn]
    append s [regsub -all {\W} [string toupper $string] -]$s
    for {set i $maxn} {$i<[string length $s]-1} {incr i} {
       for {set j [expr {$i-$maxn+1}]} {$j<=$i} {incr j} {
          set key [string range $s $j $i]
          if {[catch {incr a($key)}]} {
             set a($key) 1
          }
       }
    }
    foreach i [lsort [array names a]] {
       lappend res $i [expr {$a($i)*1./$len}]
    }
    set res
 }

It doesn't seem to give quite the same answers; not sure which code has the one-off error. (Both versions should be fine for substitution code decryption uses...)

The following version counts n-grams of specified length, maybe limited to those with a given prefix, and returns the list sorted by decreasing number:

 proc ngrams {text {length 1} {prefix ""}} {
    foreach item [regexp -all -inline $prefix[string repeat . $length] $text] {
       append a($item) .
    }
    set res {}
    foreach item [array names a] {
       lappend res [list $item [string length $a($item)]]
    }
    lsort -integer -decreasing -index 1 $res
 } ;#RS

# Testing:

 % ngrams "This is a sample text with a number of words and hopefully some repetititons"
 {{ } 13} {e 7} {s 6} {t 6} {i 5} {o 5} {a 4} {p 3} {r 3} {h 3} {l 3} {m 3} {n 3} {d 2} {u 2} {f 2} {w 2} {b 1} {T 1} {x 1} {y 1}
 % ngrams "This is a sample text with a number of words and hopefully some repetititons" 2
 {{ a} 2} {ti 2} {ns 1} {le 1} {nd 1} {{r } 1} {to 1} {{ s} 1} {ef 1} {om 1} {is 1} {{ t} 1} {{t } 1} {pe 1} {ly 1} {mp 1} {ex 1} {re 1} {op 1} {ds 1} {{ w} 1} {be 1} {{ h} 1} {wi 1} {ul 1} {or 1} {{ i} 1} {{a } 1} {um 1} {{s } 1} {th 1} {Th 1} {sa 1} {{e } 1} {of 1} {{ n} 1}
 % ngrams "This is a sample text with a number of words and hopefully some repetititons" 2 e
 {efu 1} {ext 1} {{e t} 1} {epe 1} {{e r} 1} {{er } 1}

tally: a string counter gadget

Arts and crafts of Tcl-Tk programming

Category String Processing