Interning a String

Strick:

In some languages, e.g. Java, you can "intern

" a string, which gives you one unique "interned instance" of the string with that value.

Here's a proc to do it in Tcl:

proc intern s {
       if { ! [info exists ::INTERN($s)] } { set ::INTERN($s) $s }
       set ::INTERN($s)
}

I used Tcl a dozen years before needing this, but today i wrote a program that used many of the same long strings (file pathnames) in some Tcl lists.

There's two different reasons you might want to intern a string:

To make string comparisons as fast as pointer comparisons
To save memory, when strings with the same value are used many times

Most often you intern strings for the first reason, but in Tcl that can't work: there are no separate "value" and "address" comparisons. Today I interned for the second reason.

MS remarks that Tcl has a shared pool of literal strings. Any code within proc bodies reuses literals from the shared pool. In a sense, all literal strings are already "interned", and this mechanism is mainly useful for strings that are generated at runtime.

Lars H: remarks that all direct copies of a string are shared automatically (it's the same Tcl_Obj). For example, in the result of

proc lineindex {filename} {
    set res {}
    set F [open $filename r]
    while {![eof $F]} {
        lappend res [list $filename [tell $F]]
        gets $F ; # Skip ahead one line
    }
    close $F
    return $res
}

there is one list element for every line of the file, but all of them share the Tcl_Obj for the $filename.

NEM has interned strings for the first reason (fast comparisons). In Tcl, I'd do that like this:

set INTERN_COUNTER 0
proc intern str {
    if {![info exists ::INTERN($str)]} {
        set ::INTERN($str) [incr ::INTERN_COUNTER]
    }
    return $::INTERN($str)
}

That returns a unique integer for each string and you can then do integer tests for equality. Interned strings are used often for programming language interpreters to represent symbols/ atoms/variables names to optimise for equality checks and hash-table lookups.

JBR: A quick look at tclExecute.c shows that the byte code interpreter in fact contains the trivial optimization of string compare for comparing an instance to itself.

 "tclExecute.c" line 2457 of 6089:

if (valuePtr == value2Ptr) {
     /*
        In the pure equality case, set lengths too for
        the checks below (or we could goto beyond it).
      */
     iResult = s1len = s2len = 0;
 } else if ((valuePtr->typePtr == &tclByteArrayType)

Interning a string using method 1 above optimized string compare and memory usage.

AK: I have done Tcl-level interning of strings as a form of compression. Pathnames again, several thousand times the same ones. Think versions of the same file in a version control system, for example.

Hm ... The abstract operation behind this is "moving the replicated parts of the data into a separate place". For relational databases this is normalization, creating a second table storing the replicated data using unique strings, and referencing them in the original table through a small key.

See Also edit