- Using iterators/cursors, which are stateful pointers to a 'current' element in a collection;
- Using higher-order enumerator functions, such as foreach, map, filter etc.
- Firstly, the actual traversal code is written by the author of the collection, and so can make use of internal knowledge of the collection for efficiency, without exposing any of these details to clients of the collection;
- Related to the above point, the enumeration code completely encapsulates the traversal, and so can take care of allocating and releasing any needed resources properly, even in the presence of exceptions;
- Writing an enumeration function is generally easier than writing a stateful iterator (e.g., consider a traversal for a complex tree structure such as an AVL tree);
- As iterators are usually stateful, they introduce an implicit dependency among all expressions that use the iterator, whereas an enumeration function hides its state from clients;
- Finally, iterators need some out-of-band method of indicating that the end of the collection has been reached (e.g. an isEmpty method, or a special collection empty return code or exception), whereas an enumeration function simply returns when it reaches the end of the collection.
$collection iterate ?options..? proc seedwhere 'proc' is a procedure to be called for each element of the collection, and the 'seed' argument is an initial value for a parameter of that proc. The 'proc' argument should name a procedure of the form:
proc myproc {seed item} { ...; return $newseed }An alternative syntax is to allow the parameters and body of the procedure to be given as individual arguments, which will be converted into an anonymous procedure and invoked with apply:
$collection iterate {seed item} { incr seed $item } 0This second syntax form is optional (typically it won't be supported when operating in Tcl 8.4, due to the lack of apply).The syntax of the interface resembles that of object oriented extensions to Tcl such as incr Tcl or XOTcl. However, it should also be possible to implement the interface using other means. For instance, many extensions written in C provide a similar interface (e.g. tdom), but even for pure-Tcl extensions it is not too difficult to author. To make it easier to support pure-Tcl extensions, clients of the interface should assume that $collection is a command-prefix (i.e. list) rather than a single command, and should be expanded before being called:
set result [uplevel #0 $collection iterate ...] # Or (safer) proc invoke {cmd args} { uplevel #0 $cmd $args } set result [invoke $collection iterate ...]In this way normal ensemble commands can be supported as collection interfaces with little extra machinery. As with other callback arguments, the collection command is always invoked in the global scope.Anyway, let's put syntax to one side and look now at the proposed semantics. In its basic form, this operation works just like a left-fold operation. The haskell definition of foldl is:
foldl proc seed [] = seed foldl proc seed (x:xs) = foldl proc (proc seed x) xs -- e.g. foldl (+) 0 [1,2,3] --> (((0+1)+2)+3) = 6In Tcl, that looks like:
proc foldl {proc seed list} { if {[llength $list] == 0} { return $seed } else { foldl $proc [invoke $proc $seed [head $list]] [tail $list] } } proc head list { lindex $list 0 } proc tail list { lrange $list 1 end }It runs through each element of the collection from left-to-right (with suitable definitions of 'left' and 'right' for the collection type) and calls the 'proc' procedure with the current item and the seed value. The procedure then performs any calculations and returns a new seed value for the next iteration -- in this way, the seed value becomes accumulated state (e.g. a running total) that can be calculated as we traverse the collection, without relying on external variables. This fold operation is extremely general and powerful. For more information on folds see this tutorial: [3]. In addition to returning the seed values, the iteration procedure can also return a Tcl exception code to indicate early termination of the traversal. These exceptions work with the usual semantics that they have inside loop bodies:
- TCL_OK
- normal, proceed with next iteration;
- TCL_CONTINUE
- skip to next iteration with same seed values as last time;
- TCL_BREAK
- terminate iteration and return current seed values;
- other
- abort iteration and propagate the exception.
proc invoke {cmd args} { uplevel #0 $cmd $args } proc exceptiontype {code} { set token [lindex {ok error return break continue} $code] return [expr {$token eq "" ? $code : $token}] } proc file_iterate {file proc seed} { set fd [open $file] foreach line [split [read $fd] \n] { set rc [catch {invoke $proc $seed $line} result] switch -exact [exceptiontype $rc] { ok { set seed $result } break { break } continue { continue } default { # Cleanup and propagate (needs improving) close $fd return -code $rc -errorinfo $::errorInfo $result } } } close $fd return $seed } proc file: {file method args} { if {$method eq "iterate"} { uplevel 1 [linsert $args 0 file_iterate $file] } else { uplevel 1 [linsert $args 0 file $method $file] } }Having this interface in place allows us to write code like the following:
proc def {name = args} { uplevel 1 [linsert $args 0 interp alias {} $name {}] } # Demo def myfile = file: test.txt proc count {total _} { incr total } puts "[myfile nativename] contains [myfile iterate count 0] lines" proc longest {longest line} { if {[string length $line] > [string length $longest]} { return $line } else { return $longest } } puts "The longest line is [myfile iterate longest {}]"Some benefits of this code become apparent. Firstly, we don't have to worry about opening and closing the file, as that is taken care of for us. We also don't have to worry about cleaning up if an exception occurs in the middle of processing as that is also taken care of for us by the file_iterate implementation. Now, at present this implementation isn't particularly impressive in terms of performance, although it is adequate. But imagine how fast it would be if file_iterate were coded in C. We could do all sorts of tricks, like memory-mapping the file and using optimized code to traverse the lines, avoiding line-ending conversions etc, and just calling back into a byte-compiled Tcl proc as needed. Could we then approximate the speed of wc -l in Counting a million lines? Could be worth a try.Defining generic operationsOnce we have our very general iterate procedure, we can also then define other operations in terms of it. For instance, here are the classic map and filter generalised to arbitrary collections:
proc map {proc collection} { invoke $collection iterate [list map-helper $proc] [list] } proc map-helper {proc accum item} { lappend accum [invoke $proc $item] } proc filter {proc collection} { invoke $collection iterate [list filter-helper $proc] [list] } proc filter-helper {proc accum item} { if {[invoke $proc $item]} { lappend accum $item } return $accum }We can test these on our file iteration example:
puts "The sizes of each line in [myfile nativename] are:" puts [join [map {string length} myfile] \n] proc linebigger {size line} { expr {[string length $line]>$size} } puts "The lines bigger than 80 chars:" puts [join [filter {linebigger 80} myfile] \n]As you can see, nothing in the definition of map or filter depends on any knowledge of the implementation of file_iterate, or even on the type of the underlying collection: we could write 'iterate' methods for lists, dictionaries, database queries and others and the generic definitions of map and filter will work with all of them without alteration.OptionsThe proposed interface also allows for some options to be given to the enumerate method. The full list of options and values accepted could vary from collection to collection, but an initial list of standard options that might be supported is presented here:
- -direction left|right: Specifies a traversal direction, allowing for an optional right-fold to be implemented as well. Could also be extended to allow specifying specific traversal orderings like pre- or post- or in-order traversal of graphs/trees.
- -endcommand proc: Specifies a callback to call with the final seed value when the collection has been traversed.
- -async 0|1: Enables asynchronous operation using the event loop. The iterate method returns immediately and carries on processing in the background using after or some other means (e.g. fileevents) to schedule calls to the iteration procedure. Could return some sort of token that can be used to cancel the traversal.
- -progresscommand proc: A separate progress command that is called at regular intervals to indicate how far through the collection the traversal is. This could be used for updating a progress-bar during a long traversal, for instance. Is called with a single amount-done argument, that is dependent on the underlying collection type: e.g., it could be bytes transferred or list element #. The callback is responsible for determining the total expected amount, and how much is left to do.
- -errorcommand proc: Callback to handle when the iteration command returns an error (default would be to just return the error again). This would allow for flexible error handling of e.g. parse errors during a traversal. The arguments passed are: errorCode, error message, and errorInfo of the error. If the callback returns a normal result then iteration continues as if a continue had been returned. Otherwise, the exception returned by the callback is propagated. The callback is responsible for any cleanup if needed. This allows the callback to just log and ignore some errors, while propagating others.
- TCL_OK
- result is new seed, proceed with next iteration;
- TCL_CONTINUE
- proceed with next iteration, keeping current seed;
- TCL_BREAK
- abort iteration, -endcommand is called with the current seed;
- other
- call -errorcommand with $::errorInfo, the result message, and $::errorInfo. If the exception type is not TCL_ERROR, then these values are filled in with a fabricated error, equivalent to [error "unexpected exception: $code" [list ITERATE UNEXPECTED $code] ""].
namespace eval stream { proc delay args { return $args } proc force item { uplevel #0 $item } proc nil {} { delay error "empty list" } proc cons {head tail} { delay list $head $tail } proc head stream { lindex [force $stream] 0 } proc tail stream { force [lindex [force $stream] 1] } proc take {n stream} { if {$n == 0} { return [list] } linsert [take [incr n -1] [tail $stream]] 0 \ [head $stream] } namespace export {[a-z]*} namespace ensemble create ;# 8.5ism! }Here, streams are represented as pairs of unevaluated expressions: the first element when evaluated generates a value, and the second element generates a new stream (the 'tail' of the stream). This is basically a lazy version of Lisp's cons-cells. Now we define our enumeration method for lists. However, as we are adding an extra 'self' parameter, we will call this method "superfold" rather than enumerate.
proc lsuperfold {self list proc seed} { if {[llength $list] == 0} { return $seed } else { set tail [lassign $list head] invoke $self $tail $proc [invoke $proc $seed $head] } } proc list: {list method args} { uplevel 1 [linsert $args 0 l$method $list] }Now, from this we can derive our original iterate method without the extra 'self' parameter by simply passing it itself (i.e., we find its fix-point, as you would using the Y-combinator: Hot Curry):
proc literate {list proc seed} { lsuperfold literate $list $proc $seed }And, just to demonstrate that this conforms to our original definition, we can use the generic map and filter functions on it as usual:
def l = list: {1 2 3 4 5 6 7 8 9 10} proc + {a b} { expr {$a + $b} } puts "add1 to all = [map {+ 1} l]"But adding the explicit 'self' parameter also allows us interrupt a traversal in the middle and capture its state (essentially we have made the continuation explicit, similar to a technique known as CPS - continuation-passing style). Here, we will capture the enumeration continuation and store it in a stream:
proc fold2stream {fold collection} { invoke $fold [list streamnext $fold] \ $collection snd [stream nil] } proc snd {a b} { return $b } proc streamnext {fold tail proc seed} { stream cons $seed [stream delay fold2stream $fold $tail] }We can now use this to convert our list into a stream:
def lstream = fold2stream lsuperfold set l [list 1 2 3 4 5 6 7 8 9 10] set s [lstream $l] puts "First 4 = [stream take 4 $s]"
DiscussionNEM: The original article about the Scheme version generalizes the iterate method further by allowing it to take and return multiple seeds, i.e.:
$collection iterate $proc seed seed...I initially implemented Tcl versions that way, but in practice I found it was clumsy to use for the common case where you only have 1 seed. For instance, having to wrap return values with: return [list $value] was something it is too easy to overlook. One seed should be sufficient as you can always pack up arbitrary data-structures into it. The only advantage of the multi-seed version is that it takes care of some of the unpacking and checking of these values for you (as they are just normal proc arguments).Any comments are welcome. I'm a big fan of using folds in this way, as I think it works well with scripting by moving loops into C code (assuming collections implemented in C). The only real problem is that of looping over two collections at once. The automatic conversion works well in Scheme (where call/cc handles most of the details), but in Tcl it could do with some work. If anyone has any good ideas about how to improve it, please share.At the moment, this proposal is just a suggestion. When it has matured a bit, I wonder if it might be worth writing an Informational TIP or similar to propose some standard interfaces for collections? Something similar to the Scheme SRFI referenced above.Lars H: What about a more foreach-like enumerator, i.e., the code part gets evaluated in the same stack context as the enumerator was called from? (TclX's scanfile does it that way, and my practical experience of that is that it is pretty useful to have access to various local variables.) It's clear to me how to build a command-calling enumerator from a foreach-like one, but the other way is not clear (you have to know how much to uplevel, which I didn't see specified above). The foreach-like approach also gives a trivial solution to how to pass state from one iteration to the next (the state never goes away), but one might of course argue that that is inherently bad as it is not functional.The multiplicity of options make me a bit worried (as does the fact that those options come in the middle of the command rather than at the end). It could be better to split some of them off as separate subcommands (e.g. -async on).If you're interested in specifying interfaces, you might want to take a look at [5]. It would give you a way for collections to tell their users "Yes, I support this kind of enumerator." You might also find the concept of autoadaptors interesting.NEM: The problem with a foreach-like approach, is that it limits implementation options to some extent. For instance, the -async option would be impossible to implement in Tcl if you had to guarantee that the original stack context would still be around. I'm just in the process of adapting a higher-level channel API to conform to this interface, and async operation is very important there (I haven't even implemented synchronous operation yet). I'm also rather fond of functional solutions if possible. I agree though, that it makes using state harder than with a foreach-approach. Perhaps a compromise can be reached.Regarding use of variables: I agree, it is very useful to be able to refer to surrounding variables, without some format/string map or other quoting. I'm toying with adding in some of the code from simple closures and objects, which would allow capturing the values of some of these variables when calling the enumerator. Something like:
set out [socket ...] $col iterate {seed item} { puts $out $item } ""(The seed should probably be optional for these cases). Here the $out variable is read-only, but perhaps that is enough for most uses? One thing I've been thinking of recently is a new option to apply that makes it work a bit like dict with:
dict set env out [socket ...] ... apply -environment env $func $args...Here, apply creates a new scope, initializes it with the var/value mappings in the environment dictionary and then binds the arguments. When the scope is exited, it does as dict with does, and alters the env variable to contain the new values. This would allow you to do mutable closures in loops, while the lambda itself remains a pure value (i.e., a string). This currently cannot be done entirely correctly with what we have as:
set body [list dict with __env__ $body] set func [list [linsert $params 0 __env__] $body $ns] apply $func $env ...doesn't quite work correctly -- variables from the environment will override parameters!With this in place, you'd essentially be able to get a foreach-like interface, but using apply (thus byte-compiled), and the context is captured so that you can avoid being tied to synchronous evaluation in the caller's scope. Another question would then be whether to capture all the scope (i.e. all of [uplevel 1 { info locals }]), or to have a statics list, as in Jim to limit what is captured. See the bottom of simple closures and objects for more information and a toy implementation.You're right about the options -- there are too many. A minimal set would be just -endcommand and -errorcommand. Presence of -endcommand implies -async, and -progresscommand can be implemented as part of the iteration callback by client code (which also knows more about what progress has been made).Lars H: dict with might be overkill in this case. What about the thing below (which is totally of the top of my head)? It's fairly silly (just iterates over the numbers 0, 1, ..., 99), but it does the iteration in the event loop, and it provides a way of passing "variables" from one iteration to the next. Thus you can write
set sum 0; set checksum 0 iterate_0..99 {sum checksum} {i} { incr sum $i set checksum [expr {(2*$checksum + $i) % 32003}] } { puts "The sum is $sum, and the checksum is $checksum." }and make it seem as if the "sum" and "checksum" variables persist from iteration to iteration (in reality they're parameters of the function, just like "i" is.
proc iterate_0..99 {statics items body endbody} { set func {} lappend func [concat $statics $items] set body "::list \[$body\]" set values {} foreach var $statics { append body " \[[list ::set $var]\]" lappend values [uplevel 1 [list ::set $var]] } lappend func $body [uplevel 1 {::namespace current}] set endfunc [list $statics $endbody [lindex $func end]] after 1 [list helper $func $endfunc 0 $values] } proc helper {func endfunc count values} { if {$count<100} then { set res [apply $func {*}$values $count] after 1 [list helper $func $endfunc [expr {$count+1}]\ [lrange $res 1 end]] } else { apply $endfunc {*}$values } }I still think it might be a good idea to have synchronous and asynchronous iterations handled by different subcommands (even if they might internally use much of the same mechanisms), because the differences with respect to what one can and cannot do in them are rather large.NEM: I don't see what advantage your scheme has. It doesn't cope with early returns or exceptions -- such as break and continue, which are a fairly essential part of the iteration interface. I don't agree that having separate synchronous and async commands is a good thing. With some form of closures, the differences between them become fairly small.Lars H: Hey, I didn't try to solve all problems in one go, just the one you had with environment variables possibly overwriting arguments. How to add support for break and continue should be obvious to someone of your experience (although now that I think about it, it's not entirely trivial, as the functionification of the body will convert them to errors before helper gets a chance to catch them. It follows that the actual catch should be inserted into the $func itself, whereas the processing of the return code can be done in helper; essentially the same trick I already use to retrieve the variable values).NEM: Sorry, I didn't mean to be rude. Yes, I can see how to deal with early/exceptional returns. My point is really that it is much easier to do a good job of this in the guts of apply. However, I talked with MS on the chat, and he has some interesting ideas for allowing commands in general (procs and aliases) to have associated state. I'm not sure yet, but I think something like a dict apply could be built on top of that.NEM 23-02-2011: Years later an intriguing alternative has arrived in the form of coroutines. See generator for a modern take.