Updated 2011-05-08 03:27:36 by RLE

AMG Please bear with my silly mood and my overlong narrative. We'll fix it later. :^) I think I'll move the "story" elements to a new page, leave the sort algorithms here, and move the array-based set stuff to a page on sets.

Argument

I needed a set data type with the following properties, above and beyond those of basic lists:

  • Order is unimportant. {bravo delta charlie alpha} == {alpha delta bravo charlie}.
  • An element is either in the set, or it isn't. The frequency of any given element is either zero or one.

I needed the following operations:
insertion
After inserting an element, all that is guaranteed is that it will be present in the set. It won't necessarily be at the end of the set, and if it was already present then the set will not be modified. In my program I expected that relatively few insertions would result in new elements being added.
membership
This operation returns true if and only if the element was previously inserted, i.e. is present in the set.
iteration
I'm not sure this is the name I should be using, but here I mean the ability to get a list of all elements suitable for use with [foreach].

I didn't need to worry about removing elements, doing operations involving multiple sets (union, intersection, symmetric difference, subtraction, comparison, etc.), or generating power sets or anything like that.

Book 1 - Lists, or, the Author goes on at Length

So I thought the best way of doing it was with a list that always remained sorted. Why? Because this made the membership check very fast. And by the way, I used the variable name lst rather than set to avoid Vim erroneously highlighting it as a keyword. Also it is the name of a variable in the caller's stack frame, not the set itself.
 proc scontains {lst elem} {
     upvar 1 $lst var
     return [expr {[lsearch -exact -sorted $var $elem] != -1}]
 }

This gives a nice logarithmic-time means of checking whether or not an element is in a set. For example, determining whether or not a 1000-element set contains any given element takes at most 10 checks. For comparison's sake, using an unsorted list to represent a set can require up to 1000 checks for this same situation.

And I might as well cover "iteration" while I'm laying the groundwork. A set, being represented as a list, can be used directly as a list so long as the operation is read-only. But nevertheless I'll wrap it as a proc.
 proc slist {lst} {
     upvar 1 $lst var
     return $var
 }

This returns a list of every element present in the set.

Chapter 1 - In which Insertion is Developed

Next, insertion. But first I'll state that an empty set is represented as {}. I'll wrap it in a proc for the sake of completeness.
 proc screate {lst} {
     upvar 1 $lst var
     set var {}
 }

Nonempty sets can be built up from there using insertions. Here's the simplest possible implementation:
 proc sinsert_simple {lst elem} {
     upvar 1 $lst var
     set var [lsort -unique [lappend var $elem]]
 }

This works by first appending $elem to the end of the set list, then re-sorting the list while removing any duplicates. But this is rather wasteful because there is at most one duplicate (the newly-inserted $elem), yet this fact isn't communicated to [lsort] so it has no choice but to check the whole list for duplicates.

Before going any further, let's set up a benchmarking system to prepare performance. I'll be running these tests on a 486 running at 50MHz. If your computer is slower than this, don't worry, Tcl will still run. :^)
 proc test {func} {
     screate lst
     puts [time {$func lst [expr {int(rand() * 1000)}]} 1000]
 }

This inserts 1000 random numbers into a set and prints the average insertion time. The numbers are themselves in the range 0 ... 999, so there will be a fair number of duplicates. (It might be an interesting probability problem to determine the distribution of the lengths of the sets after such a test...) Now let's try it with [sinsert_simple].
 % test sinsert_simple
 20445 microseconds per iteration

Yes, it really did take 20 seconds to run that test. Maybe it might go faster on computers of the future. Why, in another decade we might even be able to get it down to 10 seconds! :^)

Chapter 2 - Insertion gains in Sophistication

Next revision: Armed with the knowledge that the list representation need not change if $elem is already in the set, and that adding a new element never creates a duplicate, I can write:
 proc sinsert_checkmemb {lst elem} {
     upvar 1 $lst var
     if {![scontains var $elem]} {
         set var [lsort [lappend var $elem]]
     }
     return $var
 }

I'm guessing that in my test about 40% of the insertions are duplicates, so the time should go down to 12267 microseconds or so. Let's check.
 % test sinsert_checkmemb
 12098 microseconds per iteration

That's a pretty big improvement. Consistent with by 40% guess (okay, it wasn't a guess, I checked [llength $lst]), it takes 40.8% less time to run [sinsert_checkmemb] than [sinsert_simple].

Chapter 3 - In which the Author milks Insertion for all it's Worth

But there are more possibilities in store. Since there's at most one element to insert into a list that's already sorted, why re-sort the entire list? This next proc ditches [lsort] in favor of a binary search to determine at which point in the list the new element needs to be inserted (or if it is already present).
 proc sinsert_bsearch {lst elem} {
     upvar 1 $lst var

     set left  0
     set right [llength $var]

     while {$left < $right} {
         set idx [expr {($left + $right) / 2}]
         set dat [lindex $var $idx]

         switch -- [string compare $dat $elem] {
         -1 {set left [expr {$idx + 1}]}
          0 {return $var               }
          1 {set right $idx            }
         }
     }

     set var [linsert $var $left $elem]
     return $var
 }

I was initially hesitant about coding so much in Tcl, since the C-implemented [lsort] "should" be faster due to being compiled machine code, but the benchmark shows that algorithmic improvements give better results than switching from bytecode to machine code.
 % test sinsert_bsearch
 7011 microseconds per iteration

See, rather than wait a decade (okay, okay, 18 months) for hardware technology to double my program's speed, I could just drop in a better algorithm and get the same effect in no time at all.

Chapter 4 - In which the Author admits his Lie

But I've been rather irresponsible in my benchmarks. Going to a higher-performance class of algorithm (i.e. quadratic time to linear time to logarithmic time to constant time) always gives a speed boost for high n. (The n refers to the letter in [Big O] notation: O(n log n) for example.) But it also tends to increase coding complexity and setup time. In fact, it's quite common for an unsophisticated algorithm to outperform a fancy algorithm given that the actual data used has low values of n. So to even things out, first let's redo the benchmarks for various values of n rather than just 1000 as was done above.
 foreach n {1 5 10 25 50 100 250 500 1000} {
     puts "n = $n"
     foreach func {sinsert_simple sinsert_checkmemb sinsert_bsearch} {
         puts -nonewline "   $func: "; flush stdout
         set sum 0
         for {set i 0} {$i < 5} {incr i} {
             screate lst
             set dur [time {$func lst [expr {int(rand() * $n)}]} $n]
             incr sum [lindex $dur 0]
         }
         puts [expr {$sum / 5}]
     }
 }

And here's the result, manually generated from the output of the above [foreach] loop:
    n  simple  checkmemb  bsearch

    1    1611       2442     1428
    5    1662       2282     2327
   10    1736       2330     2743
   25    1938       2578     3314
   50    2337       2653     3844
  100    3076       3070     4476
  250    5771       4439     5481
  500   10487       7234     6317
 1000   20820      12650     7283

As you can see, [sinsert_bsearch] loses all the way up to around n=300, but from there on out its increased overhead pays for itself. [sinsert_simple] works fine until n=100, at which point [sinsert_checkmemb] starts to work better. The finished [sinsert] proc can determine which algorithm to use based on the list length, but first verify that the overhead of the [if] doesn't overwhelm your optimization.

I should also mention that the above table entries and the tradeoff points are all dependent on the nature of the data, specifically the ratio of unique insertions to duplicate insertions. So if you want to use any of this code in your own application, you should benchmark using actual data rather than blindly trusting the above.

Important rule of thumb: n is always small. Look before you optimize! I didn't. Which brings me to the next book in my tale of coding and woe.

Book 2 - Arrays, or, in which the Author realizes he is an Idiot

You may have noticed the use of past tense in the above narrative. I have since repented, deciding that a sorted list is not, in fact, the best way to represent a set. Per Joe English's casual suggestion (I hope I have that attribution correct), I tried using arrays. The results were both astonishing and humbling. Mostly humbling!

Chapter 1 - In which the Paradigm is Shifted

So I switched from using lists to arrays to represent my sets. For the sake of this Narrative we don't care about the value associated with any particular key in the array, so we'll just use "". But know that in your own program you're free to abuse arrays any way you wish.

Any given element can either be or not be in any given set. Any given key can either be or not be in any given array. See the parallel? The properties of sets map much more closely to arrays than they do to lists, wherein we have to do special things to avoid duplicate elements and discourage dependence on order.

First we need a way of creating a new, empty set. The following will do:
 proc screate {arr} {
     upvar 1 $arr var
     array set var {}
 }

And here's the array version of [scontains]. I'm changing from lst to arr as my set variable name. Again, I'd just say set directly if not for Vim.
 proc scontains {arr elem} {
     upvar 1 $arr var
     return [info exists var($elem)]
 }

Notice how easy that was. All the searching has moved from my script to inside the Tcl core, so this code will automatically take advantage of all future algorithmic breakthroughs in the field of fast array lookup and subsequent improvements to the Tcl array implementation.

It's a little more tricky to get a list of all elements in a set, now that the set is no longer itself a list. But it's definitely doable:
 proc slist {arr} {
     upvar 1 $arr var
     return [array names $var]
 }

The returned list is by no means sorted, but sort order's not supposed to matter anyway. If you need it to be sorted, just use [lsort] afterwards. Of course this particular proc is less efficient than its list-only method from Book 1, but it's still workable.

Chapter 2 - In which the Difficult becomes Easy (and fast, too!)

Now for set insertion. If you will recall, this occupied the bulk of the above discussion, which is actually a good thing because this page is supposed to be about sorted lists, not arrays, and therefore I'll leave all that claptrap where it is just in case somebody decides he wants a sorted list.
 proc sinsert {arr elem} {
     upvar 1 $arr var
     set var($elem) ""
 }

Don't blink or you'll miss it! This code is now very short and simple. And as it turns out, very fast. Here's proof:
 foreach n {1 5 10 25 50 100 250 500 1000} {
     set sum 0
     for {set i 0} {$i < 5} {incr i} {
         screate arr
         set dur [time {sinsert arr [expr {int(rand() * $n)}]} $n]
         incr sum [lindex $dur 0]
     }
     puts "n = $n : [expr {$sum / 5}]"
 }

Adding the results to the previous table yields absolutely marvelous results:
    n  simple  checkmemb  bsearch  array

    1    1611       2442     1428   1561
    5    1662       2282     2327   1158
   10    1736       2330     2743   1158
   25    1938       2578     3314   1162
   50    2337       2653     3844   1160
  100    3076       3070     4476   1196
  250    5771       4439     5481   1164
  500   10487       7234     6317   1163
 1000   20820      12650     7283   1168

We appear to have gone to constant time. Stunning.

I'm sure there's a hidden cost. Most likely the array representation of sets uses more memory than the list representation of sets. I don't have what I need to check the memory usage, though, so I'll leave that for someone else. Honestly, I don't really even care. If memory usage was crucial then I'd just use a slower algorithm.

Fin.

AMG I just discovered a shortcoming of the array-based set procedures. With them it is impossible to store sets inside arrays. This is fatal for the program I'm writing now. But that's not so bad since said program has n of up to, uhm, 50, and at that point performance is quite acceptable using the basic [lappend]/[lsort -unique] method.

AM Beware of code outside procedures: such code is not compiled and hence messes up the timings. I suspect that with the various versions inside a procedure the timing results will be very different!

AMG Okay, I redid the benchmarks using a proc. (Whereas above I did the table formatting manually this time I got smart and used MAK's [listToTable], found in Additional list functions. I modified it a bit to do right justification.)
    n  simple  checkmemb  bsearch  array

    1    1467       4322     1541   1890
    5    2136       2202     2678   1345
   10    2095       2679     3306   1298
   25    2526       2275     4172   1272
   50    3014       2041     4116   1047
  100    4762       2196     4856   1122
  250   10774       2931     5793   1051
  500   21721       4263     6348   1063
 1000   44843       7075     7122   1056

Yeah, that is different. Thanks for the tip. I can see more clearly now that using Tcl's built-in binary search (inside the implementation of [lsearch]) wins over my script-based binary search. But I think that if insertions were more frequent then [sinsert_bsearch] would outperform [sinsert_checkmemb]. Again, relative performance of different algorithms is often very data-dependent.

If the [lsearch -sorted] command could be made to return the index where an element should be even if not actually present in the list, that would give the best performance of all.

schlenk: tcllib, prioqueue and skiplist have some 'sorted list' kind of functionality.

NEM: With regard to the drawback of the array-based implementation, have you tried using dicts? They're in 8.5 and there is a package available which provides them for 8.4 too.

AMG: If all set access is performed through procedures ([screate], [sinsert], [slist], etc.), it's quite possible for the set implementation to store the set data in an internal namespace. For example:
 namespace eval ::set {
     variable next_id 0
 }

 proc ::set::create {} {
     set id $::set::next_id
     array set ::set::set_$id {}
     incr ::set::next_id
     return $id
 }

 proc ::set::include {id args} {
     foreach elem $args {
         set ::set::set_$id\($elem) ""
     }
 }

 proc ::set::contents {id} {
     return [array names ::set::set_$id]
 }

 % set quux [set::create]
 % set::include $quux alpha bravo charlie delta
 % set::contents $quux
 charlie delta bravo alpha

This allows client code to treat sets as ordinary (if opaque) strings yet still takes advantage of Tcl's array system.

AK: Regarding tcllib, it has set operations, under struct. Called struct::set.

AMG: I see [struct::set] makes no attempt at sorting or any other such "optimization" of dubious benefit. This is probably for the best because, since it has no clue about the nature of the data it is given, it cannot hope to select the proper algorithm. Therefore, it falls back on simple.


I'm half-tempted to also put this in Category Book. :^)