AMG: I guess
RS wanted this page, so here goes... :^)
RS: Oops, sorry, what I meant was
Shuffle a list ...
AMG: Ah, well, at least I got to write another un/sorting story page.
This page discusses shuffling algorithms and implementations thereof.
Version 1: sorting random reals edit
Here's a proc, adapted from
unsort:
proc shuffle {data} {
set rand_data [list]
foreach elem $data {
lappend rand_data [list [expr {rand()}] $elem]
}
set result [list]
foreach elem [lsort -real -index 0 $rand_data] {
lappend result [lindex $elem 1]
}
return $result
}
It works, but it's seems suboptimal.
Version 2: sorting random integers edit
For starters, the real sort bothers me. I imagine sorting by integer would be more efficient, but then again producing random integers is (currently) slower than producing random reals. Change rand() for int(0x7fffffff * rand()) and -real for -integer and it winds up taking longer. See the benchmark below. (By the way, how am I supposed to portably get the maximum and minimum integer?)
Next, having up to three copies of the list (one of which is augmented) appears to be a waste of RAM.
Version 3: in-place swaps edit
So would it be better to do an in-place shuffle? For each element of $data it could swap with another, randomly-selected element of $data. I'll take a shot at it:
proc shuffle {data} {
set length [llength $data]
for {set idx_1 0} {$idx_1 < $length} {incr idx_1} {
set idx_2 [expr {int($length * rand())}]
set temp [lindex $data $idx_1]
lset data $idx_1 [lindex $data $idx_2]
lset data $idx_2 $temp
}
return $data
}
Big improvement.
Benchmarking 1, 2, and 3 edit
Here's my magical test data:
set data [lrepeat 100000 x y z z y]
If you're trying this at home from your interactive tclsh, you might want to type "; puts -nonewline {}" after the end of the above line or else its printed result will scroll for a very long time.
Now I
time the shuffles:
..................................................
: : 400 MHz : 1500 MHz :
:....................:.............:.............:
: 1, random reals : 21.847024 s : 9.541117 s :
: 2, random integers : 24.649960 s : 9.857447 s :
: 3, in-place swaps : 7.650484 s : 2.328508 s :
:....................:.............:.............:
Wow, I think I will go with #3! Hold on while I update
unsort... :^)
Version 4: in-place swaps, better random behavior edit
AMG: Per
Lars H's suggestion, I improved the code again. As Lars explains in
unsort, the old code doesn't have a perfectly uniform distribution because, for any given list, the size of the set of permutations from which it chooses is not a multiple of the size of the set of all possible permutations. This is similar to doing "random() % 3" in C and getting 0's more often than 2's.
proc shuffle {data} {
set length [llength $data]
for {} {$length > 1} {incr length -1} {
set idx_1 [expr {$length - 1}]
set idx_2 [expr {int($length * rand())}]
set temp [lindex $data $idx_1]
lset data $idx_1 [lindex $data $idx_2]
lset data $idx_2 $temp
}
return $data
}
That ought to do it! However, it's (theoretically) a little bit slower due to modifying more variables in the inner loop. I also made a less-readable version that
incr's idx_1 rather than recalculating it with
expr, for comparison's sake. Here are the numbers, using the same benchmark.
..................................................
: : 400 MHz : 1500 MHz :
:....................:.............:.............:
: 4 , more random : 8.428660 s : 2.286190 s :
: 4a, less readable : 7.824165 s : 2.163990 s :
:....................:.............:.............:
For reasonably-sized lists (under 500,000 elements, I guess), the speed difference between #3, #4, and #4a is negligible. So I choose #4, which has a more uniform distribution and doesn't have grody sources due to silly optimizations.
I wonder why the 400 MHz results for #4 and #4a are slower than for #3 yet the 1500 MHz results are the other way around...? I guess this proc is random in more than one way.
Here's another algorithm. Basically you randomly select an element from the input list, remove it from the input list, and place it at the end of the output list. The reason I thought of it is, for
unsort it's not necessary to maintain an output list; simply
puts the elements as they come.
But this algorithm involves a lot of copying---
lset is nice for in-place modifies, but there's no in-place element removal. So it's liable to be slower than #4.