shimmering

Purpose: define what shimmering is, what causes it, why one wants it to occur or wants to avoid it, and how to cause it or avoid it.

"Shimmering" is a Tcl quirk or strength, depending on where you're coming from. It refers to the fact that internally, Tcl can keep two representations for each value. One of them is always a string, the other is usually a fast equivalent for it, i.e. ints, floats, and lists.

Donald Arseneau: "Shimmering ... is repetitive changing of the internal representation for some data. It has no effect on the program's results, but can slow it down dramatically." CL notes that a few unusual extensions seem to have the ability to break Tcl's EIAS semantics, and then shimmering becomes functionally visible in an otherwise pure-Tcl application. tcom, for example, takes "hints" from the internal representations when converting Tcl values to native machine values.

When you do "set x {a b c}", you store a 5-character string in x.

When you then do "lappend x d", you will end up with a 4-item list. Which, as far as you're normally concerned, is simply "a b c d".

And it is! - but Tcl will play a clever trick, and convert the string to an efficient internal representation of a list of things.

It's all hidden. When you do "puts $x", you get the result you expect. What happens is that puts wants a string, Tcl sees it has none any more, and then creates a proper string for you, on-the-fly.

At this time, you may decide to do "lappend x e". Tcl will happily detect that a list is there and quickly append. As a crucial side-effect, it will also discard the 7-character string, which is no longer appropriate. Keep in mind that no new string gets created at this point.

The point of all this is speed (lots of it). But conceptually, scripts can be built without caring one shred about this duality.

Until "shimmering" sets in... this is used to describe the effect that Tcl continuously alternates between creating a string representation, discarding the other one, and going back to the underlying one and discarding the string.

Here's a non-shimmering loop (no string operations at all):

  for {set i 0} {$i < 10000} {incr i} { lappend x $i }

Here's one which shimmers a bit (the list is never lost):

  for {set i 0} {$i < 10000} {incr i} { lappend x [string bytelength $x] }

This one shimmers badly (a list, a string, a list, ...):

  for {set i 0} {$i < 10000} {incr i} { lappend x $i; append x . }

This one also shimmers badly (currently), but for less obvious reasons (a side-effect of string length is that an internal representation is built which would make string index efficient):

  for {set i 0} {$i < 10000} {incr i} { lappend x [string length $x] }

Timing comparison left as exercise for the reader... You may want to start out with 1000 as upper limit instead of 10000.

All conversions in tcl currently go through the string representation. Consider:

        set i [expr {$n+1}]
        puts [llength $i]

Here, "i" will be an int first, which then needs to be turned into a list. To do so, a string is constructed, parsed, and converted to a one-item list.

Which we know is doing too much - an int cannot be anything but a one-item list. This was a contrived case, but now compare the two below:

        uplevel 1 [list myproc $arg]
        uplevel [list myproc $arg]

Trouble (a bit: only in terms of trying to achieve top performance).

How about introducing a "typecast matrix"? With a growing, perhaps dynamically extensible, set of smart conversions for special cases? If a converter is present it gets called. If it fails or there is none, then conversion progresses as usual - through an intermediate string rep.

Note that the uplevel example above requires yet more smarts. It's a list of two items, hence it cannot possibly be an int - no need to convert, fail, and have lost the list in the process.

A thought for Tcl9 perhaps? -jcw

As to the specific case jcw cites - a list of more than one element can never be successfully converted to an int ... would it not suffice to build that peephole optimisation into the C function which converts arbitrary objects to ints? In other words, rather than an extensible table of smart convertors (which seems to me would be pretty sparse), use the C code itself to implement short circuits for failure. CMCc

DGP Along these lines, see Tcl Patch 738900 [1].

AMG: I have a project to unify the internal representations for list and dict. However, I haven't touched this project in quite a while, since I've been massively busy with other stuff. Anyway, this unification would avoid shimmering in the case of using list methods to access dicts, and it would optimize the generation of the string representation of a dict. Here's a simple example of a case that would be improved:

set data [dict create {a 1 b 2}] ;# must have a dict intrep
foreach {key val} $data {...}    ;# normally should use [dict for]

Of course, you could just use dict for. But that is less flexible than general foreach.

A long-term goal of the project is to create a new data type for sets, which I think I'll have to call rings because set is very much taken. :^) The difference between a dict and a ring is that a dict has a value for each key, whereas a ring has only keys. Both would be unified with lists, in that they're really just indexes over the list intrep. Of course, a dict can currently be used to implement sets; just set all the values to be empty string. But it takes positive effort to generate such a data structure from a list of keys. If I get my way, the list of keys would be the ring.

One nice side effect of my approach is that the indexing data could be made available at the script level. This would make it possible to find which list index corresponds to a given key without having to do a search. An example use is to look up the column number corresponding to a column name in a tabular data structure.

set fields {first middle last url}
set data {
    {Donald G Porter http://math.nist.gov/~DPorter/contact/}
    {Robert L Hicks http://wiki.tcl.tk/11367}
    {Andy M Goth http://andy.junkdrome.org/}
}
foreach $fields [concat {*}[lsort -index [ring index $fields last] $data]] {
    puts [format "> %-24s : %s" "$last, $first $middle" $url]
}
> Goth, Andy M             : http://andy.junkdrome.org/
> Hicks, Robert L          : http://wiki.tcl.tk/11367
> Porter, Donald G         : http://math.nist.gov/~DPorter/contact/

Back to shimmering. This would create a new, lesser type of shimmering. Rather than blowing away the original list intrep in order to make the dict, the object type would remain list and a dict index would be generated over it. So long as the Tcl_Obj is modified using only dict accessors, the dict index would remain valid. Using lappend would wipe out the index, but using llength would not.

AK - 2010-06-29 17:47:54

Regarding sets/rings, have a look at the struct::set package in Tcllib. This package not only has the regular Tcl implementation, but also a C implementation based on Critcl, which in essence creates a Tcl_ObjType. It uses a Tcl_HashTable to make the set operations fast (i.e. quick check for existence of set elements, at the core of stuff like intersection and difference, etc.).

A general note: From time to time multi-intrep musings seem come up as an alternative to unified data structures for lists, dicts, and sets, i.e. instead of having string and intrep a Tcl_Obj is allowed to have string plus multiple intreps at the same time. I do not remember anybody having worked out all the quirks and corner cases of such a system. Although I seem to remember that FB's Colibri/Cloverfield is an attempt of doing that (plus ropes for strings).

Another note, IIRC some of the builtin commands, like foreach, have been modified to work with dicts as well, without converting them to list. This however is done on command-by-command basis by checking for the various intrep types in the command.

jcw - 2010-06-29 21:49:24

Re dicts, lists, sets...

There's a way to generalize it way further, IMO - without losing the benefits of each of them. I've written about this a few times in the past, but ironically I can't find any good reference to that anymore. Maybe it's just a bad idea that has gone into hiding again? Let me try to reconstruct it again here.

Consider every data structure along the following three aspects:

the number of key fields (K: 0..up) - keys also determine ordering
whether the key fields are unique or can be repeated (U: 0..1)
the number of residual fields (R: 0..up)

There are 8 combinations, if you separate K and R in 0 and >0 cases. The most obvious ones are:

K = 0, U = 0, R = 1 is a list
K = 1, U = 1, R = 1 is a dict
K = 1, U = 1, R = 0 is a set

The other cases are a bit harder to nail down:

K = 0, U = 0, R = 0 is a count (no content, but any number of them, i.e. a zero-column table)
K = 0, U = 1, R = 0 is a boolean (no content, but you can have either 0 or 1 of 'em)
K = 0, U = 1, R = 1 is a optional value (either one value or no value)
K = 1, U = 0, R = 0 is a multi-set (like a set, but each value can occur more than once)
K = 1, U = 0, R = 1 is a multi-dict (like a dict, but each value can occur more than once)

What does this all have to do with lists, dicts, sets, and shimmering? Well, I have a hunch that they could all be nailed down with a single data structure. Such a data structure would store the following information:

the values for K, U, and R
a sized array of Tcl_Obj pointers
a hash index (unused if K is 0)

The access is by row/column, mapped into the array in row-major order. The number of columns is K + R. The number of rows is <array-size> / (K + R). In the simplest case, K and R need only be 0 or 1, but there is no reason why this constraint couldn't be dropped, allowing for arbitrary rectangular table shapes.

Again, what's the relevance for shimmering? Well, if you have a single universal data structure, then shimmering can be avoided. To switch from a list of pairs to a dict, change K from 0 to 1. That changes a list to a multi-dict (there's bit more to it, since dicts don't enforce ordering, but I'll ignore that for now). All you need to do is generate the hash index.

I'm fudging lots of details, and probably sweeping some problems under the carpet. But the gist is that the difference between lists, dicts, sets, and tables could be treated as changes in interpretation of the elements of the array positions, without moving anything around. Plus a hash index which gets created and dropped as needed to match the current interpretation. That's a different kind of shimmering, and one which could perhaps be lazily adjusted to avoid a lot of work. So that when you access a dict as a list, the index is kept but tagged as inactive, and then when you access the list as dict again, the index could be reactivated if the list contents hasn't changed. Any change to the value while it is a list would cause the index to be jettisoned.

I also think that this approach could support efficient tabular structures at no extra cost (just allow K and R to be anything >= 0). Being values, these things can be nested. And with optional values (i.e. a list with 0 or 1 elements), the nesting might even allow representing tables with ... NULL support).

My hunch is that K, U, and R should probably not be accessible at the Tcl level. Just like you can't access the type of a Tcl_Obj* from Tcl. The representation would be adjusted internally, based on how the value is accessed and manipulated. Just like it is done today with lists vs. dicts.

Phew. I hope this makes some sense.

AMG: To me at least, it only makes sense to create a hash table index over the keys. With K=0, what is there to index? Finding the index of a row is done simply by multiplying R by the row number, something that can be done perfectly well in script right now (e.g. foreach with first argument list length = R).

How would K>1 be indexed? Would each key be indexed separately? If so, how would this be accessed from script level? The only thing I can think of is again something that's currently possible and efficient:

foreach word {I do not know where family doctors acquired illegibly
perplexing handwriting nevertheless extraordinary pharmaceutical
intellectuality counterbalancing indecipherability transcendentalizes
intercommunications incomprehensibleness} {
    dict set index length [string length $word] $word
    dict set index hash [expr {[string length $word] * [regexp -all -nocase {[aeiou]} $word]}] $word
}
puts $index ;# pretty-printed below
length {1 I 2 do 3 not 4 know 5 where 6 family 7 doctors 8 acquired\
    9 illegibly 10 perplexing 11 handwriting 12 nevertheless 13 extraordinary\
    14 pharmaceutical 15 intellectuality 16 counterbalancing 17 indecipherability\
    18 transcendentalizes 19 intercommunications 20 incomprehensibleness}\
hash {1 I 2 do 3 not 4 know 10 where 12 family 14 doctors 32 acquired\
    27 illegibly 30 perplexing 33 handwriting 48 nevertheless 65 extraordinary\
    84 pharmaceutical 90 intellectuality 96 counterbalancing 119 indecipherability\
    108 transcendentalizes 152 intercommunications 140 incomprehensibleness}

Pardon the goofy example. I'm having a hard time thinking of multiple sets of unique keys that could map to common data. The point is that I have constructed a single dict that effectively has K=2 and R=1. Even though it looks like there are two copies of the data, the Tcl_Objs are shared. Querying by either key is done by accessing the sub-dict named after that key.

Or maybe K>1 means that the concatenation of the keys is to be indexed. This can be done by putting the keys together in a list. Note that the lists will shimmer to string before they can be indexed. (The ints from the previous example also shimmer to string.)

foreach word {...} {
    dict set index [list $key1 $key2] $word
}

As for R>1, put the values together into a list. R=0? Use empty string for the value.

U=0 (duplicate keys allowed) can only be done with lists, not with dicts. Wait, actually that's not true. dicts can be used; just make each key map to the list of values that are associated with that key.

As far as I can tell, your proposal succeeds only in eliminating the extra list Tcl_Objs currently required for putting K>1 and R>1 into a dict. Flattening the data structure could reduce shimmering, but at what cost? Two of the structures you mentioned (count, boolean) are implemented more efficiently with shimmering: just make them ints stored directly in the Tcl_Obj, with no need to point to a List or whatever struct. How would the optional value appear at the script level? See null for attempts to grapple with this problem. There are currently several ways, and they're all neatly defined at the script level; e.g. [info exists varname] (0=undefined, 1=value is in $varname) or [llength $varname] (0=undefined, 1=value is in [lindex $varname 0]).

While I may be opposed to implementing the full generality of your proposal, I was planning on putting a field in my hash-index-over-list data structure that gives the R value, but I would only allow it to be zero or one. The existence or nonexistence of the hash table would correspond to K=1 and K=0, respectively. I would have U=1 hard-coded (uniqueness required), because script can build capacity for duplicates on top of it without the need to complicate the existing [dict] command.

Don't get me wrong, what you have to say is very interesting and valuable at the theoretical level; it's a great way to categorize data structures. I just don't think it's practical, not in Tcl at least.

jcw - With K, U, and R not visible at the Tcl level, not all variants need to be implemented for this proposal to work (since you can't tell from the outside anyway). Cases such as K > 1 cannot "happen" with the current set of Tcl commands. A new "rtable" command could be created which provides access (i.e. adjusts K/U/R to specific values). For example, "rtable t1 first last : age shoesize country" could create a "t1" command with K set to 2, U set to 1 (due to the use ":" iso of say "/") and R set to 3. Then, "t1" becomes something very similar to the "dict" command, but specialized for a specific K/U/R mode and with named fields. This would allow things such as "t1 size $value" and "t1 set myvar John Doe 12 7 US". I'm making this up as I go along, so don't read too much into it, but I think something like this could be made to fit Tcl's concepts very nicely. With common cases such as "rtable mylist / value" (≈ a list) and "rtable mydict key : value" (≈ a dict).

DGP I would need to examine this more carefully to determine whether or not I think there is a good "common implementation" idea here. However, even without converting any implementations, I like that this discussion aims at framing a common language for talking about the various kinds of "containers" in Tcl. One of the few good ideas floating about for adding actual new syntax to the Tcl language is aimed at prettier ways to pull element values from container values held in variables, and it appears this conceptual framework could be a useful guide in that discussion.

AMG: JCW, your comments suggest that you intend for the concatenation of the key field to be unique, rather than each key field having to be unique on its own. For example, there can only be one "John Doe" ([list $first last]), but there can also be "John Smith", "Jane Doe", and "Elton John", only one each. I wasn't clear on that at first.

AK - 2010-06-30 11:00:27

Hi jcw, nice to see you here (again) :).

Regarding your data structure categorization I seem to remember to have seen it in some documents about and/or relating to your VLERQ project.

AMG: JCW, your discussion of R>1 residual fields reminds me of my new lcomp, which allows multiple generated elements per combination of inputs.

lcomp {$key} {$val} for key in {a b c} and val in {1 2 3}          ;# returns "a 1 b 2 c 3"
lcomp {$word} {[string length $word]} for word in {welcome to tcl} ;# returns "welcome 7 to 2 tcl 3"

See Also edit