[11:02] Colin I can't even remember writing it, and doubt I could again ... I wonder if I've had a stroke [11:02] Colin Anyway ... since I copied glimpse's design, all good. Let me refresh on details [11:04] Colin Ok. Packed indices are just like glimpse's. So, word + file#. Minimal overhead assuming the set of words is small relative to source file sizes. Like, typical english is 10K words. [11:05] Colin that's on disk. [11:05] Colin It unpacks the disk files into arrays, again keys are words, contents are lists of file numbers. [11:06] Colin Each file is given a unique number, see, its ordinal number in the ordered set of all files being indexed. [11:06] Colin I think I stipulated a maximum of 64K files, hence 16 bits per occurrence. [11:07] Colin so file overhead is strlen(word)+1 + 2*occurrence_count [11:07] Colin Search, once the index is read is constant (hashed) [11:08] Colin cost to unpack it is mostly linear I guess, traverse the index and turn it into a tcl array. [11:08] Colin cost to build the index is proportional to the size indexed, I guess. Uses hash/array to build index. [11:09] Colin fundamentally, I think/hope: linear in textbase size. [11:09] Colin the performance of the thing should be linear and proportional to the size of the stuff searched. [11:10] Colin the size should, for expected text sets (natural language or nearly) be a fraction of the stuff indexed. [11:10] Colin Just like glimpse. [11:12] Colin The *very* clever thing about glimpse is that it realises you can load a file and search for a word very quickly, in ram. [11:12] Colin And so all they need to do is construct the set of all unique words in a set of files, and record in which file each word occurs. [11:13] Colin They give each file a unique number, and so the recording of occurrence is just association of a set of file numbers with a word. [11:13] Colin IIRC, they sort their word->files association, but we don't have to because I load it into an array. [11:14] Colin So they have a second index, being word->word#, then word#->file#+ then file#->filename [11:15] aku so the index is word->file, not word->exact location, and exact location are found later by regular search. smaller index, possibly slower in search, but not too much, as you said. [11:15] Colin Exactly!
Category Application