A generic collection traversal interface

NEM 14 Jan 2006 What's the best way to iterate over the elements of a collection? There are a number of different ways of doing this. The simplest is to provide some way to convert your collection to a list, e.g. by having a string representation which coincides with that of Tcl's lists, or by providing an explicit toList operation. However, while this is a simple approach, it is not the most efficient, particularly when the collection being iterated over is complex or expensive to access (e.g. a database). There are two main alternative methods of performing iteration in an effective manner:

Using iterators/cursors, which are stateful pointers to a 'current' element in a collection;
Using higher-order enumerator functions, such as foreach, map, filter etc.

The article "Towards the best collection traversal interface" by Oleg Kiselyov [1] discusses these alternatives and makes a powerful case for the latter option: that a higher-order enumerator is the best default choice for a collection interface. In particular, the article recommends using a left-fold operation with early termination as the enumerator function. There are a number of reasons for this decision:

Firstly, the actual traversal code is written by the author of the collection, and so can make use of internal knowledge of the collection for efficiency, without exposing any of these details to clients of the collection;
Related to the above point, the enumeration code completely encapsulates the traversal, and so can take care of allocating and releasing any needed resources properly, even in the presence of exceptions;
Writing an enumeration function is generally easier than writing a stateful iterator (e.g., consider a traversal for a complex tree structure such as an AVL tree);
As iterators are usually stateful, they introduce an implicit dependency among all expressions that use the iterator, whereas an enumeration function hides its state from clients;
Finally, iterators need some out-of-band method of indicating that the end of the collection has been reached (e.g. an isEmpty method, or a special collection empty return code or exception), whereas an enumeration function simply returns when it reaches the end of the collection.

These are all compelling reasons for enumeration functions, and this is the approach that Scheme took to collections in SRFI 44: Collections [2]. However, it should be noted that there are some drawbacks to using enumerator functions. While enumerator functions don't suffer from a problem of how to indicate end-of-collection, they suffer from the complimentary problem of how to indicate early termination of the enumeration: with an iterator, you simply stop using the iterator, but with an enumerator you have to signal to the enumerator function to stop the traversal. The article referenced above proposes using a boolean result indicator at each iteration which says whether to continue or not. In Tcl, however, we already have the break and continue exceptions for exactly this purpose, so the problem is effectively solved without introducing new machinery. The other, more serious, problem of enumeration functions is that it is difficult to enumerate over more than one collection at once in a general way. foreach provides an elegant solution to this, which can probably be generalised for the interface I propose below. However, it is also possible to define a routine that converts any collection enumerator function of the form defined into an iterator. I describe how to do this after presenting the proposal for a general collection traversal interface.

Proposal

The proposal is to use a generalized fold operation as the default enumeration mechanism for any collections defined. A collection can be anything that logically contains some elements, including: data-structures such as the Tcllib graph and tree, OO container classes, database query interfaces, XML documents, general parsers for data-files (where the elements of the container are elements of the syntax, e.g. BibTeX records), and many others. This is deliberately a very general definition, meant to apply to a wide variety of different situations. By defining a standard interface for traversals, we can then define lots of utility functions in terms of this interface which will work efficiently for many different collection types.

The interface proposed is:

 $collection iterate ?options..? proc seed

where 'proc' is a procedure to be called for each element of the collection, and the 'seed' argument is an initial value for a parameter of that proc. The 'proc' argument should name a procedure of the form:

 proc myproc {seed item} { ...; return $newseed }

An alternative syntax is to allow the parameters and body of the procedure to be given as individual arguments, which will be converted into an anonymous procedure and invoked with apply:

 $collection iterate {seed item} { incr seed $item } 0

This second syntax form is optional (typically it won't be supported when operating in Tcl 8.4, due to the lack of apply).

The syntax of the interface resembles that of object oriented extensions to Tcl such as incr Tcl or XOTcl. However, it should also be possible to implement the interface using other means. For instance, many extensions written in C provide a similar interface (e.g. tdom), but even for pure-Tcl extensions it is not too difficult to author. To make it easier to support pure-Tcl extensions, clients of the interface should assume that $collection is a command-prefix (i.e. list) rather than a single command, and should be expanded before being called:

 set result [uplevel #0 $collection iterate ...]
 # Or (safer)
 proc invoke {cmd args} { uplevel #0 $cmd $args }
 set result [invoke $collection iterate ...]

In this way normal ensemble commands can be supported as collection interfaces with little extra machinery. As with other callback arguments, the collection command is always invoked in the global scope.

Anyway, let's put syntax to one side and look now at the proposed semantics. In its basic form, this operation works just like a left-fold operation. The haskell definition of foldl is:

 foldl proc seed []         = seed
 foldl proc seed (x:xs)     = foldl proc (proc seed x) xs
 -- e.g. foldl (+) 0 [1,2,3] --> (((0+1)+2)+3) = 6

In Tcl, that looks like:

 proc foldl {proc seed list} {
     if {[llength $list] == 0} { return $seed } else {
         foldl $proc [invoke $proc $seed [head $list]] [tail $list]
     }
 }
 proc head list { lindex $list 0 }
 proc tail list { lrange $list 1 end }

It runs through each element of the collection from left-to-right (with suitable definitions of 'left' and 'right' for the collection type) and calls the 'proc' procedure with the current item and the seed value. The procedure then performs any calculations and returns a new seed value for the next iteration -- in this way, the seed value becomes accumulated state (e.g. a running total) that can be calculated as we traverse the collection, without relying on external variables. This fold operation is extremely general and powerful. For more information on folds see this tutorial: [3]. In addition to returning the seed values, the iteration procedure can also return a Tcl exception code to indicate early termination of the traversal. These exceptions work with the usual semantics that they have inside loop bodies:

TCL_OK: normal, proceed with next iteration;
TCL_CONTINUE: skip to next iteration with same seed values as last time;
TCL_BREAK: terminate iteration and return current seed values;
other: abort iteration and propagate the exception.

To demonstrate, here is an implementation of the interface for text files (using a TOOT-like implementation):

 proc invoke {cmd args} { uplevel #0 $cmd $args }
 proc exceptiontype {code} {
     set token [lindex {ok error return break continue} $code]
     return [expr {$token eq "" ? $code : $token}]
 }
 proc file_iterate {file proc seed} {
     set fd [open $file]
     foreach line [split [read $fd] \n] {
         set rc [catch {invoke $proc $seed $line} result]
         switch -exact [exceptiontype $rc] {
             ok         { set seed $result }
             break      { break }
             continue   { continue }
             default    {
                 # Cleanup and propagate (needs improving)
                 close $fd
                 return -code $rc -errorinfo $::errorInfo $result
             }
         }
     }
     close $fd
     return $seed
 }
 proc file: {file method args} {
     if {$method eq "iterate"} {
         uplevel 1 [linsert $args 0 file_iterate $file]
     } else {
         uplevel 1 [linsert $args 0 file $method $file]
     }
 }

Having this interface in place allows us to write code like the following:

 proc def {name = args} {
     uplevel 1 [linsert $args 0 interp alias {} $name {}]
 }
 # Demo
 def myfile = file: test.txt
 proc count {total _} { incr total }
 puts "[myfile nativename] contains [myfile iterate count 0] lines"
 proc longest {longest line} {
     if {[string length $line] > [string length $longest]} {
         return $line
     } else {
         return $longest
     }
 }
 puts "The longest line is [myfile iterate longest {}]"

Some benefits of this code become apparent. Firstly, we don't have to worry about opening and closing the file, as that is taken care of for us. We also don't have to worry about cleaning up if an exception occurs in the middle of processing as that is also taken care of for us by the file_iterate implementation. Now, at present this implementation isn't particularly impressive in terms of performance, although it is adequate. But imagine how fast it would be if file_iterate were coded in C. We could do all sorts of tricks, like memory-mapping the file and using optimized code to traverse the lines, avoiding line-ending conversions etc, and just calling back into a byte-compiled Tcl proc as needed. Could we then approximate the speed of wc -l in Counting a million lines? Could be worth a try.

Defining generic operations

Once we have our very general iterate procedure, we can also then define other operations in terms of it. For instance, here are the classic map and filter generalised to arbitrary collections:

 proc map {proc collection} {
     invoke $collection iterate [list map-helper $proc] [list]
 }
 proc map-helper {proc accum item} {
     lappend accum [invoke $proc $item]
 }
 proc filter {proc collection} {
     invoke $collection iterate [list filter-helper $proc] [list]
 }
 proc filter-helper {proc accum item} {
     if {[invoke $proc $item]} {
         lappend accum $item
     }
     return $accum
 }

We can test these on our file iteration example:

 puts "The sizes of each line in [myfile nativename] are:"
 puts [join [map {string length} myfile] \n]
 proc linebigger {size line} { expr {[string length $line]>$size} }
 puts "The lines bigger than 80 chars:"
 puts [join [filter {linebigger 80} myfile] \n]

As you can see, nothing in the definition of map or filter depends on any knowledge of the implementation of file_iterate, or even on the type of the underlying collection: we could write 'iterate' methods for lists, dictionaries, database queries and others and the generic definitions of map and filter will work with all of them without alteration.

Options

The proposed interface also allows for some options to be given to the enumerate method. The full list of options and values accepted could vary from collection to collection, but an initial list of standard options that might be supported is presented here:

-direction left|right: Specifies a traversal direction, allowing for an optional right-fold to be implemented as well. Could also be extended to allow specifying specific traversal orderings like pre- or post- or in-order traversal of graphs/trees.
-endcommand proc: Specifies a callback to call with the final seed value when the collection has been traversed.
-async 0|1: Enables asynchronous operation using the event loop. The iterate method returns immediately and carries on processing in the background using after or some other means (e.g. fileevents) to schedule calls to the iteration procedure. Could return some sort of token that can be used to cancel the traversal.
-progresscommand proc: A separate progress command that is called at regular intervals to indicate how far through the collection the traversal is. This could be used for updating a progress-bar during a long traversal, for instance. Is called with a single amount-done argument, that is dependent on the underlying collection type: e.g., it could be bytes transferred or list element #. The callback is responsible for determining the total expected amount, and how much is left to do.
-errorcommand proc: Callback to handle when the iteration command returns an error (default would be to just return the error again). This would allow for flexible error handling of e.g. parse errors during a traversal. The arguments passed are: errorCode, error message, and errorInfo of the error. If the callback returns a normal result then iteration continues as if a continue had been returned. Otherwise, the exception returned by the callback is propagated. The callback is responsible for any cleanup if needed. This allows the callback to just log and ignore some errors, while propagating others.

We can now give a more complete description of how callback return codes interact with these return codes:

TCL_OK: result is new seed, proceed with next iteration;
TCL_CONTINUE: proceed with next iteration, keeping current seed;
TCL_BREAK: abort iteration, -endcommand is called with the current seed;
other: call -errorcommand with $::errorInfo, the result message, and $::errorInfo. If the exception type is not TCL_ERROR, then these values are filled in with a fabricated error, equivalent to [error "unexpected exception: $code" [list ITERATE UNEXPECTED $code] ""].

There are probably other options that can be thought of. Conventions for common options can evolve over time.

Converting an enumeration into an iterator

This next section is for the curious; feel free to skip it.

As I mentioned earlier, one of the drawbacks of an enumeration method compared to an iterator is that there is no easy way to combine iteration of two separate collections. This is essentially because the enumeration method moves the loop (the traversal code) out of the application code and into the collection code: if you have two collections, then which one gets to be in charge of the enumeration? And how does it access elements of the other collection? foreach manages to do this, but only because it knows that each collection it enumerates is a list and so it knows how to iterate through all of them. A general enumeration scheme cannot make such a guarantee, so it would seem that we are stuck. In the cases where you want to iterate through multiple collections at once, it seems that an iterators-style interface is the only sensible option. Thankfully, there is a general way to convert any enumerator style interface of the sort proposed into an iterator style interface, which will now be demonstrated. In a language with continuations this conversion can be accomplished relatively easily, however Tcl does not have such a feature. Oleg Kiselyov has another article [4] describing how to do the conversion in a language without continuations (in his case, Haskell). I will now show a conversion of that code into Tcl.

The details of the conversion are not entirely straight-forward. In order to allow our traversal to be turned inside-out, we have to introduce a new parameter to the iterate interface which will be used to make recursive calls to enumerate for each iteration. To avoid changing the interface presented above, this new parameter could be introduced as an option -self proc. Unfortunately, the conversion also requires rewriting our enumeration function in a rather inefficient way: using recursion instead of iteration. In Tcl, this makes the resulting code perform quite poorly, so it may be wise for collection authors to hand-code a custom iterators interface rather than relying on this conversion, at least for now. Perhaps if Tcl ever adds continuations, or some way can be found to reduce the overhead of this conversion then it would become more practical. Still, it is worth seeing what the conversion looks like.

I will demonstrate the conversion using an enumeration interface for lists, as they are perhaps the simplest collection. Our iterator interface will actually use streams, which is the nicest way of exposing an iterator as it also manages to encapsulate the state of the traversal. First, then, we will define our stream data-type (i.e., a lazy list):

 namespace eval stream {
    proc delay  args        { return $args }
    proc force  item        { uplevel #0 $item }
    proc nil    {}          { delay error "empty list" }
    proc cons   {head tail} { delay list $head $tail }
    proc head   stream      { lindex [force $stream] 0 }
    proc tail   stream      { force [lindex [force $stream] 1] }
    proc take   {n stream}  {
        if {$n == 0} { return [list] }
        linsert [take [incr n -1] [tail $stream]] 0 \
                [head $stream]
    }
    namespace export {[a-z]*}
    namespace ensemble create ;# 8.5ism!
 }

Here, streams are represented as pairs of unevaluated expressions: the first element when evaluated generates a value, and the second element generates a new stream (the 'tail' of the stream). This is basically a lazy version of Lisp's cons-cells. Now we define our enumeration method for lists. However, as we are adding an extra 'self' parameter, we will call this method "superfold" rather than enumerate.

 proc lsuperfold {self list proc seed} {
     if {[llength $list] == 0} {
         return $seed
     } else {
         set tail [lassign $list head]
         invoke $self $tail $proc [invoke $proc $seed $head]
     }
 }
 proc list: {list method args} {
     uplevel 1 [linsert $args 0 l$method $list]
 }

Now, from this we can derive our original iterate method without the extra 'self' parameter by simply passing it itself (i.e., we find its fix-point, as you would using the Y-combinator: Hot Curry):

 proc literate {list proc seed} {
     lsuperfold literate $list $proc $seed
 }

And, just to demonstrate that this conforms to our original definition, we can use the generic map and filter functions on it as usual:

 def l = list: {1 2 3 4 5 6 7 8 9 10}
 proc + {a b} { expr {$a + $b} }
 puts "add1 to all = [map {+ 1} l]"

But adding the explicit 'self' parameter also allows us interrupt a traversal in the middle and capture its state (essentially we have made the continuation explicit, similar to a technique known as CPS - continuation-passing style). Here, we will capture the enumeration continuation and store it in a stream:

 proc fold2stream {fold collection} {
     invoke $fold [list streamnext $fold] \
            $collection snd [stream nil]
 }
 proc snd {a b} { return $b }
 proc streamnext {fold tail proc seed} {
     stream cons $seed [stream delay fold2stream $fold $tail]
 }

We can now use this to convert our list into a stream:

 def lstream = fold2stream lsuperfold
 set l [list 1 2 3 4 5 6 7 8 9 10]
 set s [lstream $l]
 puts "First 4 = [stream take 4 $s]"

Discussion

NEM: The original article about the Scheme version generalizes the iterate method further by allowing it to take and return multiple seeds, i.e.:

 $collection iterate $proc seed seed...

I initially implemented Tcl versions that way, but in practice I found it was clumsy to use for the common case where you only have 1 seed. For instance, having to wrap return values with: return [list $value] was something it is too easy to overlook. One seed should be sufficient as you can always pack up arbitrary data-structures into it. The only advantage of the multi-seed version is that it takes care of some of the unpacking and checking of these values for you (as they are just normal proc arguments).

Any comments are welcome. I'm a big fan of using folds in this way, as I think it works well with scripting by moving loops into C code (assuming collections implemented in C). The only real problem is that of looping over two collections at once. The automatic conversion works well in Scheme (where call/cc handles most of the details), but in Tcl it could do with some work. If anyone has any good ideas about how to improve it, please share.

At the moment, this proposal is just a suggestion. When it has matured a bit, I wonder if it might be worth writing an Informational TIP or similar to propose some standard interfaces for collections? Something similar to the Scheme SRFI referenced above.

Lars H: What about a more foreach-like enumerator, i.e., the code part gets evaluated in the same stack context as the enumerator was called from? (TclX's scanfile does it that way, and my practical experience of that is that it is pretty useful to have access to various local variables.) It's clear to me how to build a command-calling enumerator from a foreach-like one, but the other way is not clear (you have to know how much to uplevel, which I didn't see specified above). The foreach-like approach also gives a trivial solution to how to pass state from one iteration to the next (the state never goes away), but one might of course argue that that is inherently bad as it is not functional.

The multiplicity of options make me a bit worried (as does the fact that those options come in the middle of the command rather than at the end). It could be better to split some of them off as separate subcommands (e.g. -async on).

If you're interested in specifying interfaces, you might want to take a look at [5]. It would give you a way for collections to tell their users "Yes, I support this kind of enumerator." You might also find the concept of autoadaptors interesting.

NEM: The problem with a foreach-like approach, is that it limits implementation options to some extent. For instance, the -async option would be impossible to implement in Tcl if you had to guarantee that the original stack context would still be around. I'm just in the process of adapting a higher-level channel API to conform to this interface, and async operation is very important there (I haven't even implemented synchronous operation yet). I'm also rather fond of functional solutions if possible. I agree though, that it makes using state harder than with a foreach-approach. Perhaps a compromise can be reached.

Regarding use of variables: I agree, it is very useful to be able to refer to surrounding variables, without some format/string map or other quoting. I'm toying with adding in some of the code from simple closures and objects, which would allow capturing the values of some of these variables when calling the enumerator. Something like:

 set out [socket ...]
 $col iterate {seed item} { puts $out $item } ""

(The seed should probably be optional for these cases). Here the $out variable is read-only, but perhaps that is enough for most uses? One thing I've been thinking of recently is a new option to apply that makes it work a bit like dict with:

 dict set env out [socket ...] ...
 apply -environment env $func $args...

Here, apply creates a new scope, initializes it with the var/value mappings in the environment dictionary and then binds the arguments. When the scope is exited, it does as dict with does, and alters the env variable to contain the new values. This would allow you to do mutable closures in loops, while the lambda itself remains a pure value (i.e., a string). This currently cannot be done entirely correctly with what we have as:

 set body [list dict with __env__ $body]
 set func [list [linsert $params 0 __env__] $body $ns]
 apply $func $env ...

doesn't quite work correctly -- variables from the environment will override parameters!

With this in place, you'd essentially be able to get a foreach-like interface, but using apply (thus byte-compiled), and the context is captured so that you can avoid being tied to synchronous evaluation in the caller's scope. Another question would then be whether to capture all the scope (i.e. all of [uplevel 1 { info locals }]), or to have a statics list, as in Jim to limit what is captured. See the bottom of simple closures and objects for more information and a toy implementation.

You're right about the options -- there are too many. A minimal set would be just -endcommand and -errorcommand. Presence of -endcommand implies -async, and -progresscommand can be implemented as part of the iteration callback by client code (which also knows more about what progress has been made).

Lars H: dict with might be overkill in this case. What about the thing below (which is totally of the top of my head)? It's fairly silly (just iterates over the numbers 0, 1, ..., 99), but it does the iteration in the event loop, and it provides a way of passing "variables" from one iteration to the next. Thus you can write

  set sum 0; set checksum 0
  iterate_0..99 {sum checksum} {i} {
     incr sum $i
     set checksum [expr {(2*$checksum + $i) % 32003}]
  } {
     puts "The sum is $sum, and the checksum is $checksum."
  }

and make it seem as if the "sum" and "checksum" variables persist from iteration to iteration (in reality they're parameters of the function, just like "i" is.

  proc iterate_0..99 {statics items body endbody} {
      set func {}
      lappend func [concat $statics $items]
      set body "::list \[$body\]"
      set values {}
      foreach var $statics {
          append body " \[[list ::set $var]\]"
          lappend values [uplevel 1 [list ::set $var]]
      }
      lappend func $body [uplevel 1 {::namespace current}]
      set endfunc [list $statics $endbody [lindex $func end]]
      after 1 [list helper $func $endfunc 0 $values]
  }

  proc helper {func endfunc count values} {
      if {$count<100} then {
          set res [apply $func {*}$values $count]
          after 1 [list helper $func $endfunc [expr {$count+1}]\
            [lrange $res 1 end]]
      } else {
          apply $endfunc {*}$values
      }
  }

I still think it might be a good idea to have synchronous and asynchronous iterations handled by different subcommands (even if they might internally use much of the same mechanisms), because the differences with respect to what one can and cannot do in them are rather large.

NEM: I don't see what advantage your scheme has. It doesn't cope with early returns or exceptions -- such as break and continue, which are a fairly essential part of the iteration interface. I don't agree that having separate synchronous and async commands is a good thing. With some form of closures, the differences between them become fairly small.

Lars H: Hey, I didn't try to solve all problems in one go, just the one you had with environment variables possibly overwriting arguments. How to add support for break and continue should be obvious to someone of your experience (although now that I think about it, it's not entirely trivial, as the functionification of the body will convert them to errors before helper gets a chance to catch them. It follows that the actual catch should be inserted into the $func itself, whereas the processing of the return code can be done in helper; essentially the same trick I already use to retrieve the variable values).

NEM: Sorry, I didn't mean to be rude. Yes, I can see how to deal with early/exceptional returns. My point is really that it is much easier to do a good job of this in the guts of apply. However, I talked with MS on the chat, and he has some interesting ideas for allowing commands in general (procs and aliases) to have associated state. I'm not sure yet, but I think something like a dict apply could be built on top of that.

NEM 23-02-2011: Years later an intriguing alternative has arrived in the form of coroutines. See generator for a modern take.

Category Concept