proc freq12 string { # returns a pairlist: character/bigrams and the associated frequency set last - set n 0 foreach char [split $string- ""] { incr n inc a($char) inc a($last$char) set last $char } set res {} foreach i [lsort [array names a]] { lappend res $i [expr {$a($i)*1./$n}] } set res } proc inc {varName {amount 1}} { # create a variable if not exists, then increment upvar 1 $varName var if {![info exists var]} {set var 0} incr var $amount }
% freq12 "TCL-IS-A-SCRIPTING-LANGUAGE" - 0.178571428571 -A 0.0357142857143 -I 0.0357142857143 -L 0.0357142857143 -S 0.0357142857143 -T 0.0357142857143 A 0.107142857143 A- 0.0357142857143 AG 0.0357142857143 AN 0.0357142857143 C 0.0714285714286 CL 0.0357142857143 CR 0.0357142857143 E 0.0357142857143 E- 0.0357142857143 G 0.107142857143 G- 0.0357142857143 GE 0.0357142857143 GU 0.0357142857143 I 0.107142857143 IN 0.0357142857143 IP 0.0357142857143 IS 0.0357142857143 L 0.0714285714286 L- 0.0357142857143 LA 0.0357142857143 N 0.0714285714286 NG 0.0714285714286 P 0.0357142857143 PT 0.0357142857143 R 0.0357142857143 RI 0.0357142857143 S 0.0714285714286 S- 0.0357142857143 SC 0.0357142857143 T 0.0714285714286 TC 0.0357142857143 TI 0.0357142857143 U 0.0357142857143 UA 0.0357142857143
[We could have a category just for "character-based management of textual corpora", or, perhaps more conventionally, "statistical text classification" or "n-grams".]Oft-cited Damashek paper from '95: [1]. Also [2], [3], and [4].
DKF - Here's a version extended to copy with longer sequences too. It is a bit slower for the digram case, but much more flexible (not just in what it can produce, but also in what it can take as input.)
proc freq {string {maxn 2}} { set len [string length $string] set s [string repeat - $maxn] append s [regsub -all {\W} [string toupper $string] -]$s for {set i $maxn} {$i<[string length $s]-1} {incr i} { for {set j [expr {$i-$maxn+1}]} {$j<=$i} {incr j} { set key [string range $s $j $i] if {[catch {incr a($key)}]} { set a($key) 1 } } } foreach i [lsort [array names a]] { lappend res $i [expr {$a($i)*1./$len}] } set res }It doesn't seem to give quite the same answers; not sure which code has the one-off error. (Both versions should be fine for substitution code decryption uses...)
The following version counts n-grams of specified length, maybe limited to those with a given prefix, and returns the list sorted by decreasing number:
proc ngrams {text {length 1} {prefix ""}} { foreach item [regexp -all -inline $prefix[string repeat . $length] $text] { append a($item) . } set res {} foreach item [array names a] { lappend res [list $item [string length $a($item)]] } lsort -integer -decreasing -index 1 $res } ;#RS# Testing:
% ngrams "This is a sample text with a number of words and hopefully some repetititons" {{ } 13} {e 7} {s 6} {t 6} {i 5} {o 5} {a 4} {p 3} {r 3} {h 3} {l 3} {m 3} {n 3} {d 2} {u 2} {f 2} {w 2} {b 1} {T 1} {x 1} {y 1} % ngrams "This is a sample text with a number of words and hopefully some repetititons" 2 {{ a} 2} {ti 2} {ns 1} {le 1} {nd 1} {{r } 1} {to 1} {{ s} 1} {ef 1} {om 1} {is 1} {{ t} 1} {{t } 1} {pe 1} {ly 1} {mp 1} {ex 1} {re 1} {op 1} {ds 1} {{ w} 1} {be 1} {{ h} 1} {wi 1} {ul 1} {or 1} {{ i} 1} {{a } 1} {um 1} {{s } 1} {th 1} {Th 1} {sa 1} {{e } 1} {of 1} {{ n} 1} % ngrams "This is a sample text with a number of words and hopefully some repetititons" 2 e {efu 1} {ext 1} {{e t} 1} {epe 1} {{e r} 1} {{er } 1}