Updated 2016-02-23 21:43:06 by escargo

Philip Quaife 8 May 2004.

In musing over A Case for Metaprogamming and Dictionaries as arrays, I have had some new thoughts on a concept that advances the unification of arrays, lists, and other data structures by way of the $ notation.

Introduction

We access variables using the set command. We also use $ as an alias for set (but with slightly different parsing rules for the variable name).

When accessing parts of a variable's value, we resort to functions specific for the type of data that the variable contains.
i. e.: lindex - for lists, string range - for strings.

Each data type has a plethora of associated functions for manipulating a datum.

The following is a conceptual change to the handling of a [tclObjType] that allows a more uniform accessor method.

The accessor model is usually found in OO languages where all object properties are hidden and can only be read/written through object methods. Usually all objects support the same getter and setter method names for consistency.

In TCL we only expose a "string" as our public property. This allows us to use other representations for optimizations when strings are not the most appropriate format without affecting code that uses our object types. We see this in mathematical expressions for example.

Take the following:
   interp alias {} get {} set
   set x Hello
   set x -> Hello
   get x -> Hello

We see that the semantics of get/set are not dissimilar. While in TCL we use the duality of set for both operations, consider for the moment that we use the get/set model. We also implement this by way of extending the internal tclObjType handler to have pointers for getter/setter functions.

Now take the following:
   command $var
   command $var(key)

These can be interpreted as:
   command [get var] is the same as [set var]
   command [get var key]

The meaning of:
    $var(key)

currently is: get the variable from the hash var that has the key key and return its tclObj datum.

The new meaning would be: call the tclObjType accessor function for the variable var requesting the part key .

When a variable has a type of assocArray (new obj type) then the key would refer to an entry in an attached associative array whose value is a tcl variable structure.

Note there is no inherent meaning of the interpretation of key. Each tclObjType will use key as appropriate.

Some subsequent examples will clarify this.

Now we see why we cannot use the duality of set for both read and write. We now have extended the get proc to have a second argument that we will apply the notation of:
   part

We think of the value within the parentheses as a request for part of the whole. Such as with an array, we request one of the elements not all of them.

By definition when there is no parentheses then we are requesting the whole of the datum.

The raison d'etre for this change in interpretation of the notation: $var(index), is to allow a common syntax regardless of internal representation, as well as extensibility.

Rather than limiting the notation to arrays we are free to apply the notation to any tclObjType.

Note we will also have to change the implementation of set to call the setter method of the tclObjType for the variable when given a variable notation of: var(key). In all likelihood we would change set to set var part value notation.

Benefits

  1. Unified syntax.
  2. Extensibility.
  3. less obfuscation of the meaning of the statements.

Contradictions

  1. The main problem in the implementation of this construct is that associative arrays are a kludge that has never been reworked when objectifying tcl variables.
  2. A associative array is an abstract quantity, so how do you call setter/getter functions when you don't know the type of data associated with the variable until you have accessed it?
  3. Performance. While a reference to a scalar will degenerate to a call to set, which will be byte coded out of existence, and an array reference will turn into a cached hash lookup, commands such as lindex and string range will not be able to be bytecoded when called as accessors through the $ notation (due to run time type determination).

Pause and Think

Nothing stated above requires any change in syntax or causes any incompatibility with the current handling of associative arrays. The above is not an attempt to replace list handling functions or string handling functions.

Nor is the above a switch in methodologies to an Object-based notation. In fact the use of get/set is consistent with the current use of a common function regardless of type.

What it does, is, for those people that want it, provide a more concise way of slicing up a variable's data.

For associative arrays, there is no change conceptually to how these retrieve and store data. There is a huge change in the core code that implements arrays and most likely any code that makes reference to associative arrays. One hopes that the advent of the dict tclObjType for tcl 8.5 and associated changes to arrays will making this change a practical proposition.

Changes in the core The main change would be to add two more function pointers to the tclObjType structure. One for setting and one for getting.

The set method would only be called when requesting setting part of the data (as in one element of an associative array).

Likewise for the get function, as an objects string rep and the object data itself are available to the caller.

Examples

Now let us extend the 'accessor function to other tclObjTypes.

Try A String:
   set astring {Hello!}
   puts $astring(0) -> H
   puts $astring(end) -> !
   puts $astring(1 3) -> ell

Try lists:
   set alist {A {B C D} E} ;# ms pointed out this is not a list yet
   set alist [list A [list B C D] E] ;# this is a list

   puts  $alist(0) -> A
   puts $alist(end) -> E
   puts $alist(1 2} -> C

MS notes that this is not what would happen; until some list op is done to $alist, it has a string tclObjType; so that [puts $alist(1 2)] -> " \{" and not "C". Had there been an intervening [llength $alist] the story would change. ,pwq: amended above thanks.

Try Dicts:
   set fred [dict create A 1 B 2 C 3 D 4] ;# or what ever syntax is implemented.
   $fred add New [dict create A 100 B 10]
   puts $fred(A) -> 1
   puts $fred(end) -> {}
   puts $fred(1)  -> {}
   puts $fred(New A) -> 100
   puts "I think that [get $fred {New A}] is the same as $... above"

Try Keyed lists as structures, or binary data represented as named quantities (As in ASN.1 notation for example).

Pause and Think

You do not have to use it.

What the above does, is give the programmer the ability to determine what meaning the $ notation has.

If you want you can always use [set varname] to access a variable rather than $ notation.

What's happening behind the variable

Take for example accessors for a list type. The notation $var(end) could be interpreted as:
   lindex $var end

While the notation: $var(1 .. 3), Could be interpreted as:
      lrange $var 1 3

This could be a call through the scripted Tcl interface , or it could be a direct call to the lower level C API function, or it could be implemented inside the accessor function.

Which is the best approach probably depends on the underlying data type. Maybe this model is best reserved for creating new Abstract Data Types and the core types, such as dicts, lists, arrays, do not implement any accessor functions.

Note 'In the above, could means just that. The meaning of the notation is entirely determined by the getter function. Any defaults as applied to core types such as list would be dictated by the TCT. It could even raise an error. The programmer can however override this default interpretation if desired. The above is an example of one such interpretation.'

Var vs $var

Currently set takes varname rather than tclObj as the reference to the variable to be set. It is not yet determined if the new Tcl commands set/get should/need to take varname. The principle requirement would be that traces can be determined when a variable is accessed. Ideally, the use of the tclObj directly would be a more orthogonal one.

Exposing accessors to the script level

The most benefit to having accessors would be realized if the functionality is available as scripted procs.

This would allow programmers to change the meaning of the key inside the parentheses to create new constructs that match the application processing of the data.

Coupled with namespaces allowing private getter/setter functions allows a controlled and structured replacement strategy. I.e. this does not need to affect the default accessor behaviour much like namespaces allow the overriding on core tcl command procs with safety.

But we don't need to do it

We also do not need a virtual file system in TCL. I can do the same thing under Linux. I can mount an ftp server as a directory and any program can access files as though they were local to the machine.

However there are times when the VFS facility is of use, such as in starkits.

Likewise, the ability to implement accessor functions can be of benefit when the programmer requires them.

The key phrase would be:
Is there any reason to limit functionality.

You only know they are needed when the job requirements call for them and you do not have them.

NEM This is quite interesting, and partially related to some stuff I have been thinking about recently, with a view to eventually working towards a TIP. See Feather and particularly read up on the interfaces stuff, as this is very related. The getter/setter methods to Tcl_ObjType you propose above would be instead in an interface. Paul came up with a generic container interface (or something like that), which was similar. The mechanism is more generalized though. One thing which needs to be cleared up in the above is the difference between values and variables. [set] works with variables, and we have the following scheme currently:
 set var "a"  ;# Store the value "a" in the variable "var"
 set var       ;# Retrieve the value stored in the variable "var"
 set foo(a) "bar" ;# Store the value "bar" in the variable with key "a" which is part of the array variable "foo"
 set foo(a)   ;# etc

The point is, that the $a(b) syntax (and the [set] equivalent) currently means finding a value which is stored in a variable which is stored inside an array variable. AIUI, your proposed change is to just have normal variables (and drop arrays), so that:
 puts $foo(a)

would instead retrieve the value held in the variable foo, and then locate a sub-part in that which corresponds to the key "a". The difference is subtle, but involves one less dereference than currently:

Current:

  • find variable called "foo"
  • get array "value" from that
  • find variable called "a" in array
  • get value from that

Proposed:

  • find variable called "foo"
  • get container value from that
  • get part "a"

In the new scheme, when we have found what "a" refers to, we just return it, as it is a value. Currently, what "a" corresponds to is a variable so we have to dereference it again to fetch the actual value (and trigger traces on it etc). This makes reusing the array syntax problematic, as we cannot deal with arrays and container values in the same way.

The above description of arrays is probably not how they actually work (I haven't checked), but hopefully it is conceptually correct. The array "value" is something which is never actually visible at the Tcl level, but is instead manipulated through operations on the variable that contains it (it is opaque, and the variable name is the handle). Tcl treats arrays specially in this regard.

PWQ: The main benefit of feather seems to be that it allows multiple representations of an tclObj that helps prevent shimmering. It's a shame that there is not more documentation on the point of the feather extension. I have compiled it and run through the examples (such as tree.tcl) but don't have any feel for the actual point of it all.

Did you read the Paul Duffin article referenced on Feather? What Paul was trying to do was create an infrastructure that made it much easier to create new data types .

Lars H: The big problem with the above is that it assumes that Tcl values have types! The meaning of
   $V(2)

would be different depending on whether V was set as
   set V {4 3 2 1};            # Type string: [string index] accessor => 3
   set V [list 4 3 2 1] ;      # Type list: [lindex] accessor         => 2
   set V [dict create 4 3 2 1];# Type dict: [dict get] accessor       => 1

whereas today these are all identical (at least for a suitable choice of hash function; there's a 50% chance that [dict create 4 3 2 1] returns "2 1 4 3" instead). That is a very fundamental change to the language, and it would break several programming idioms.

Today one can do
    set F [open "tempfile" w]
    fconfigure $F -encoding utf8 -translation lf
    puts -nonewline $F $value
    close $F

later do
    set F [open "tempfile" r]
    fconfigure $F -encoding utf8 -translation lf
    set value [read $F]
    close $F

and recover exactly the same $value, regardless of how complicated that value is! (It probably won't be stored in the same way, but any script can go ahead and use it as if it had never been written to file.) This is possible because Tcl accessors come with a choice of interpretation for the data they are applied to, but would not be possible with the one-accessor-fits-all strategy proposed above.

I might add that having "types" (technically known as "categories": letter, other, active, begin-group, end-group, etc.) of data internally but losing these when data is written to a file has been a major headache in the history of LaTeX, and with respect to multi-lingual documents still is. Since Tcl already has a fully working solution here, there is no need to break it.

Another problem, which NEM mentioned briefly above, it that the proposer seems to be confused about the distinction between values and variables. set is a device for accessing variables. Tcl_Objs are values. Variables were never obj'ified in any other sense than that they store Tcl_Objs, and that was equally done with array elements as with scalar variables. Integers, floats, and lists on the other hand were obj'ified. Commands can be obj'ified (expect Tcl_Objs as arguments), but need not be (there is a C command for declaring Tcl commands that take strings as arguments).

What probably could be done (but not in any Tcl 8.*, please) is to change the interpretation of variable names, so that one could have a name for a part of the value stored in the variable, and use that in cases where a variable name was expected. Suppose for example that if a variable name has a proper list structure, and the first element of that list is the name of some type of container, e.g. list. Then this would be interpreted as referring to a part of the value stored in the variable whose name is given in the second element of the name, and any remaining elements will be interpreted as specifying what part precisely. Examples, with translations:
   ${list L 2}
     # [lindex $L 2]
   set {list L 2} xyz
     # lset L 2 xyz
   set {list L 2 3}
     # lindex $L 2 3
   set {dict D surname}
     # dict get $D surname
   set {list {list L 2} 3}
     # lindex [lindex $L 2] 3   

It should of course be nestable,
   set {list {dict D addresses} 0} {10 Downing Street, ...}
     # dict set D addresses [lreplace [dict get $D addresses] 0 0 {10 Downing Street, ...}]

that is where it really shines! It can also be handy with commands such as scan that only stores data in variables. (Try coding
   binary scan $headerBytes H8IA40A20c {dict header checksum}\
     {dict header designsize} {dict header codingscheme}\
     {dict header family} {dict header face}

without using auxiliary variables in current Tcl.)

If anything should be done with respect to the $arr(index) notation, it should rather be to create alternatives than to make it more powerful, because
   set Arr($index) $newval

has a problem with respect to shimmering of $index (it has to first be embedded into a string, and then parsed out of that string). One could easily extend the above scheme so that
   set [list array Arr $index] $newval

provides such an alternative, if desired.

PWQ 11 May 04, Thanks Lars H for those comments. I don't believe I am confused about vars and objects. Lars H: You do propose "set" and "get" entries in the Tcl_Obj structure, despite "set" being something that modifies a variable. pwq: see followup And objects do have types, it's the first field of the Tcl_Obj structure! Lars H: That is a private implementation detail, which it is a bug to expose at the script level. The language would be a lot more fragile with public types than it currently is.

The premiss for the above all stems from the fact that $ is supposed to be conceptually the same as set, however that connection appears to have long since been lost.

In closing, TCL programmers don't care how variables or objs are stored or accessed internally, they are only concerned in using TCL commands to perfrom some processing. The extension of the $ notation is one way of getting that processing done with less typing. When tcl introduced the $ notation, it did not have extended data types such as dicts and structs (or possibly even arrays), so its time to bring the $ notation up to speed with the rest of the core.

pwq Followup 12 May 04:

Lars H, asked the question, "do set/get operate on TclObj structure or variables?

The answer to this is that set/get operate on variables. The subtle difference would be as follows:
  Case 1
      set var [get x]

The above is the same as now:
        set var $x

  Case 2
     set var [get x subpart] ;# aka set var $x(subpart)

In the above case, set has to dereference x and then call the accessor function for the object type of x , and then assign the returned Tcl_Obj to var.

In C Pseudo Code with errors and definitions omitted.
 set var(0) $x
   SetObjCmd(dummy, interp, objc, objv) {
        varPtr = TclObjLookupVar(interp, part1Ptr, part2, flags,
            /*createPart1*/ 0, /*createPart2*/ 0, objc[1]);
                /* LookupVar now only needs to deal with scalars */
        if { $part2 == NULL } {
                varPtr->objPtr = objc[2];
                Tcl_ReturnObj(varPtr->objPtr)
        } else 
                varPtr->objPtr->typPtr->setProc(part1Ptr->objPtr,part2,objc[2]);
                Tcl_ReturnObj(varPtr->objPtr->typePtr->getProc(varPtr->objPtr,part2))
        }
   }

I am not sure what problem Lars H has with the above. All we have actually done is move the call to a hash lookup from within the LookupVar function to, the ObjType structure (assuming that we now have a array type of object).

Another consideration, since we do have variables and TclObj and this creates issues when looking at changes like those mentioned above, I propose instead that variables would be implemented as just another Tcl_ObjType so that we can always access the variable info if needed without having to pass a varname to set/get. Insead we would pass a Tcl_Obj that was a variable type, set /get would do the necessary double indirect dereference to return the objPtr.

Again I do not see anything controversial, or magical about my proposal.

Lars H, 12 May 2004: You appear familiar with the details of Tcl:the-C-library, but not so familiar with Tcl:the-language. Nothing magic about your proposal? It could probably be implemented, but there are likely to be some ugly corner cases. Nothing controversial about your proposal? Read what I wrote was the big problem with it. It does throw away everything is a string! You can't get more controversial than that!