Fuzzy string search

Richard Suchenwirth 2002-08-03 - Comparing strings for equality is easily done in Tcl - but returns "unequal" even if strings are "almost equal", differ in very few positions, e.g. because of a typo. Measuring string similarity with the Levenshtein distance is precise, but slow, so applying that method to a long list of candidate words will take Tcl too long. Therefore, I here experiment with a faster string filtering method in order to exclude all but a few candidates. Fuzzy values are often floating-point between 1.0 (true) and 0 (false), but I scale the results to integers between 0 and 100 - not because integers would be faster, but because results look more compact this way.

The idea is that binary "features" of a word are extracted. Up to 32 of such fit well into one integer, so part of the comparison can be done with integer arithmetics in expr. Let's start with a "Tcl spell checker" that tries to match a string against the known commands, and returns a few most similar candidates. For simplicity, the first features will just be single lowercase letters (plus _), which gives us 27 features. The first routine returns the "bitkey" integer for a string, given a feature list:

 proc bitkey {string features} {
    set res 0
    set i 0
    foreach feature $features {
        if {[string first $feature $string]>=0} {
            set res [expr {$res | 1<<$i}]
        }
        incr i
    }
    set res
 }

We can use this to produce a lookup table of bitkeys for a given word list, implemented as an array. This takes care that different words may return the same bitkey, so new results are appended to possible previous results. The feature set is also kept in the array:

 proc makeLUT {arrName words features} {
    upvar 1 $arrName arr
    set arr() $features
    foreach word $words {
        lappend arr([bitkey $word $features]) $word
    }
 }
# A simple similarity measure, common bits/possible bits: 
 proc bitkeySimilarity {a b {scale 100}} {
    set and [bitcount [expr {$a & $b}]]
    set or  [bitcount [expr {$a | $b}]]
    expr {$and * $scale / $or}
 }
 proc bitcount int {
    # returns the number of 1-bits
    set res 0
    set bit 32
    while {$bit} {
        if {$int & 1<<[incr bit -1]} {incr res}
    }
    set res
 }
 proc bitcount0 int {
    # this alternative was tested, but took almost double the time
    binary scan [binary format I $int] b* bits
    regexp -all 1 $bits
 }

This is the fuzzy search "engine" which returns all candidates whose bit similarity is not below the given threshold:

 proc similars {arrName string threshold} {
    upvar 1 $arrName arr
    set bitkey [bitkey $string $arr()]
    set res {}
    foreach candidate [array names arr] {
        if {$candidate == ""} continue
        set sim [bitkeySimilarity $bitkey $candidate]
        if {$sim >= $threshold} {
            lappend res [list $arr($candidate) $sim]
        }
    }
    lsort -real -decreasing -index 1 $res
 }
#------------------------now testing: 
 proc test {word} {
    global lut
    set dictionary [concat [info commands] [info globals]]
    set features {a b c d e f g h i j k l m n o p q r s t u v w x y z _}
    if {![info exists lut]} {makeLUT lut $dictionary $features}
    similars lut $word 60
 }
 proc testn {{n 1}} {
    puts "[llength [array names ::lut]] candidates:"
    foreach word {
        forach swithc prok brek whyle retorn retourn enkoding
    } {
        set t [lindex [time {set res [test $word]} $n] 0]
        puts $word:$t->$res
    }
 }

Example results:

 311 % test enkoding
 {encoding 75} {tkEventMotifBindings 63} {tkListboxBeginExtend 63} {widget:info 60}
 304 % test forach
 {foreach 85}
 318 % test  whyle
 {while 66}
 302 % test  continu
 {bitcount 85} {continue 85} {ccount 83} {popup:init 71} {text:linecount 66} {counter 62} {tkMenuMotion 62}

In a standard tclsh with 109 words (info commands + info globals), search time per word was around 80 msec.

In another test, for which the data are too voluminous for the Wiki, 3626 Swiss place names were taken as dictionary. As about 10% of the entries collapsed with others, the lookup table had 3335 entries. On my P200 box, each query took about 2.2 seconds. At 3776 (Swiss names and [info commands] combined ;-), this grew to 2.5 sec. Not surprisingly, search time grows linearly with the size of the lookup table. One can control the search time by manipulating the features list: using only the ten {a e i o u n r s t h}, the lookup table had only 743 entries (of a possible 1024) and could be traversed in 0.5 seconds per try. The returned "equivalence classes" were considerably longer. You have to trade off quick filtering against few results..

In experimenting we occasionally see surprising results in first position - because the string length is not considered for similarity (it cannot, because the bitkey does not preserve that information).

 % test retorn
 {tkButtonEnter 71} {counter 71} {.top.entry 71} {return 66} {entry 66} {errorInfo 66} {sep:intro 62} {PaneGeometry 62} {tkMenuDownArrow 62} {error 60}

Here is an "after-burner" that weights results with length similarity:

 proc weightLength {word results} {
    set lword [string length $word]
    set res {}
    foreach result $results {
        set value [lindex $result 1]
        foreach candidate [lindex $result 0] {
            set lcand [string length $candidate]
            set minl [expr {$lword < $lcand? $lword: $lcand}]
            set maxl [expr {$lword > $lcand? $lword: $lcand}]
            set value2 [expr {$value*$minl/$maxl}]
            lappend res [list $candidate $value2]
        }
    }
    lsort -integer -decreasing -index 1 $res
 }

Its result makes more sense, at added time cost of 1.5 msec:

 % weightLength retorn [test retorn]
 {return 66} {counter 60} {entry 55} {error 50} {errorInfo 44} {.top.entry 42} {sep:intro 41} {tkButtonEnter 32} {PaneGeometry 31} {tkMenuDownArrow 24}

Another filter takes limited locality into account, by comparing the bigram (two-character sequences) sets of the word in question to the candidates, with gratuituous leading and trailing spaces added:

 proc bigrams string {
    set res {}
    set last " "
    foreach char [split "$string " ""] {
        lappend res $last$char
        set last $char
    }
    set res
 }
 % bigrams bigrams
 { b} bi ig gr ra am ms {s }

As we don't define a feature set here, mapping bigrams to bits is excluded. Instead, two logical operations on sets represented as lists are used:

 proc and {L1 L2} {
    set res {}
    foreach i $L1 {set a($i) 1}
    foreach i $L2 {set b($i) 1}
    foreach i [array names a] {
        if {[info exists b($i)]} {
            lappend res $i
        }
    }
    set res
 }
 proc or {L1 L2} {
    foreach i [concat $L1 $L2] {set a($i) 1}
    array names a
 }
 proc bigramMatch {string1 string2 {scale 100}} {
    set b1 [bigrams $string1]
    set b2 [bigrams $string2]
    expr {$scale * [llength [and $b1 $b2]]/[llength [or $b1 $b2]]}
 }
 proc weightBigrams {word results {threshold 6}} {
    set res {}
    set b1 [bigrams $word]
    foreach result $results {
        foreach candidate [lindex $result 0] {
            set sim [bigramMatch $word $candidate]
            if {$sim >= $threshold} {
                lappend res [list $candidate $sim]
            }
        }
    }
    lsort -integer -decreasing -index 1 $res
 }

This one filters much stronger, even with low threshold, but is more sensitive to transpositions:

 % weightBigrams retorn [test retorn]
 {return 55} {error 8} {errorInfo 6}
 % weightBigrams retrun [test retrun]
 {return 40} {unset 18} {entry 8} {counter 7}

A compromise between single characters ("monograms") and bigrams is made with the following variation:

 % monoBigrams monoBigrams
 m o n o B i g r a m s { m} mo on no oB Bi ig gr ra am ms s

 proc monoBigrams string {
    set res [split $string ""]
    set last " "
    foreach char [concat $res {{}}] {
        lappend res $last$char
        set last $char
    }
    set res
 }
 proc mbMatch {string1 string2 {scale 100}} {
    set b1 [monoBigrams $string1]
    set b2 [monoBigrams $string2]
    expr {$scale * [llength [and $b1 $b2]]/[llength [or $b1 $b2]]}
 }

… or how's trigrams (three-character sequences)?

 % trigrams trigrams
 { tr} tri rig igr gra ram ams {ms }

Not so useful, except on very long, hardly disturbed words - on shorter words, substitutions, insertions or deletions disturb the set of trigrams very much.

 proc trigrams string {
    set res {}
    set last " "
    set prev [string index $string 0]
    foreach char [lrange [split "$string " ""] 1 end] {
        lappend res $last$prev$char
        set last $prev
        set prev $char
    }
    set res
 }

Artur Trzewik Here a implementation that use lsort. It is 45% faster as 'or' and 'and' usage in my test. Also modified trigrams proc make some procents improvements. Anyway trigrams are better solution for longer text used for example in translation memories.

  proc stringFuzzyMatch {str1 str2} {
    set l1 [lsort -unique [trigrams2 $str1]]
    set l2 [lsort -unique [trigrams2 $str2]]
    set g 0
    set i1 0
    set i2 0
    set len1 [llength $l1]
    set len2 [llength $l2]
    set t1 [lindex $l1 $i1]
    set t2 [lindex $l2 $i2]
    set l 0
    while {$i1<$len1 && $i2<$len2} {
        incr l
        set r [string compare $t1 $t2]
        # better than switch statement
        if {$r==0} {
            incr i1
            set t1 [lindex $l1 $i1]
            incr i2
            set t2 [lindex $l2 $i2]
            incr g
        } elseif {$r==-1} {
            incr i1
            set t1 [lindex $l1 $i1]
        } else {
            incr i2
            set t2 [lindex $l2 $i2]
        }
    }
    expr {100*$g/$l}
  }
  proc trigrams2 string {
    set rez [list]
    set str " $string "
    set l [expr {[string length $str]-2}]
    for {set x 0} {$x<$l} {incr x} {
        lappend rez [string range $str $x [expr {$x+2}]]
    }
    return $rez
  }

After taking look into Python difflib, which is in core, I wonder why Tcl have no fuzzy match algorithms in tcllib or in core, but is considered as string computation first choose language by some people.

KBK observes that struct::list has the 'longest common subsequence' operation that is at the heart of many 'fuzzy match' techniques. It's coded for lists, and not strings, but that's because it's trivial to do [split $string {}] to make a list for it, while it would be quite difficult to apply a string-oriented matcher to lists. We've had diff in Tcl for quite a long time.

See also edit