Updated 2018-09-28 12:36:12 by PeterLewerin

Splitting strings with embedded strings

See Also  edit

Regular Expressions Are Not A Good Idea for Parsing XML, HTML, or e-mail Addresses

Question  edit

Richard Suchenwirth 2001-05-31: - Robin Lauren <robin@lauren.net> wrote in comp.lang.tcl:

I want to split an argument which contains spaces within quotes into proper name=value pairs. But I can't :)

Consider this example:
set tag {body type="text/plain" title="This is my body"} 
set element [lindex $tag 0]
set attributes [lrange $tag 1 end] ;# *BZZT!* Wrong answer!

My attributes becomes the list {type="text/plain"} {title="This} {is} {my} {body"} (perhaps even with the optional backslash before the quotes), which isn't really what i had in mind.

Answer  edit

Should you try to solve this yourself anyway?

The problem statement doesn't specifically say that the strings to be split are SGML/HTML/XML attribute lists, but it certainly looks like it. A minimal solution splits the string provided above, but a flexible solution should be able to split strings that bend the rules in different ways, preferably allowing for all variations that are legal in the chosen syntax.

PYK 2014-03-03: The proper solution is to use a full XML parser like tDOM, because, as the examples below illustrate, any other solution will have holes in its coverage, as they do not take every aspect of SGML/HTML/XML attribute syntax into account.

[name redacted]: That is an important consideration when deciding how to solve just about any problem, and, yes, the full definition of attribute syntax has more rules than Tcl does, so there are certainly lots of things that can trip you up. However,

  • Such a parser doesn't accept element fragments.
  • Most attribute definitions in the wild are pretty regular.
  • While ready-made solutions to most problems exist, it is still worthwhile to discuss the basics of how to achieve such solutions.

So,

  • if the element-attribute strings are still in place in the XML or HTML document, or
  • the strings are very irregular, or
  • what you need is a professional-level solution,

something like tDOM is likely to be the best choice. If, on the other hand

  • the fragments are already extracted, and
  • are guaranteed to be regular (or can easily be made regular),

or you are simply curious about how to extract data from strings and feel like playing around with it some, you might as well try one of the following methods.

Never mind, let's just look at some solutions

(Solution work initiated by Richard Suchenwirth 2001-05-31, initial regular expression solution suggested by MG. Solutions reworked and significantly extended 2014-03-09)

If there are always exactly two attribute definitions following an element name, one simple solution is to scan the string, and then enclose the name/value pairs in sublists:
set parts [scan $tag {%s %[^ =] = "%[^"]" %[^ =] = "%[^"]"}]
# -> body type text/plain title {This is my body}
set result [list]
foreach {name value} [lrange $parts 1 end] { lappend result [list $name $value] }
set result
# -> {type text/plain} {title {This is my body}}

For a more general solution, where there can be less or more than two definitions, a regexp match might be useful:
set matches [regexp -inline -all {(\S+?)\s*=\s*"(.*?)"} $tag]
# -> type=\"text/plain\" type text/plain {title="This is my body"} title {This is my body}
set result [list]
foreach {- name value} $matches { lappend result [list $name $value] }
set result
# -> {type text/plain} {title {This is my body}}

(Note that here, the foreach command extracts three values from the list during each iteration: the first value (stored in the variable named -) is just discarded.)

The problem can also be solved using list/string manipulation commands, but then we need to make sure that we see the data in the same way as Tcl does. To a human, $tag intuitively looks like a list of three items, but according to Tcl list syntax, it has 6 items, and the second item, for example, contains two literal quotes.
llength $tag
# -> 6
lmap item $tag { format "{%s}" $item }
# -> {{body}} {{type="text/plain"}} {{title="This}} {{is}} {{my}} {{body"}}

The thing that stops Tcl from seeing the attribute list our way is the equal signs in direct connection to the quotation marks. Replace them with spaces, and we have pretty much the list we want:
set taglist [string map {= { }} $tag]
# -> body type "text/plain" title "This is my body"
llength $taglist
# -> 5
lmap item $taglist { format "{%s}" $item }
# -> {{body}} {{type}} {{text/plain}} {{title}} {{This is my body}}
set result [list]
foreach {name value} [lrange $taglist 1 end] { lappend result [list $name $value] }
set result
# -> {type text/plain} {title {This is my body}}

This is a very uncomplicated solution, but there is a problem: it takes out every equal sign in the start tag string, including any equal signs that might be a part of your attribute values. If your attribute values contain quoted-printable data, using this method isn't a very good idea.

Another solution splits the tag string into a list, not by white space but by double quotes (\x22 is just a wiki-friendly (meaning it doesn't mess with syntax highlighting) way to insert a double quote character: a \" or {"} will work in the Tcl interpreter):
set taglist2 [split $tag \x22]
# -> {body type=} text/plain { title=} {This is my body} {}

Obviously, the result needs a little more processing:

  1. the element name is joined up with the first attribute name,
  2. the equal sign stays attached to the attribute name,
  3. the second (and third, etc) attribute name is preceded by leftover whitespace, and
  4. there is an empty element which resulted from splitting at the last double quote before the end of the string.

The first three problems are easily dealt with (a string consisting of a space and some non-space characters can be split into a list with an empty first item and the non-space substring as the second element):
string trimright [lindex [split {body type=}] 1] =
# -> type
string trimright [lindex [split { title=}] 1] =
# -> title

and the fourth problem can be solved by breaking out of the loop if any attribute name is the empty string:
foreach {name value} $taglist2 {
    if {$name eq {}} break
    lappend result [list [string trimright [lindex [split $name] 1] =] $value]
}
set result
# -> {type text/plain} {title {This is my body}}

All of the above solutions are minimal in the sense that while they do process the provided string properly, they may not be able to process other strings that must be considered regular attribute lists, e.g.

  1. attribute lists with more (or less) than two attributes
  2. empty attribute values (with foo="" notation, HTML syntax allows another way to specify empty attribute values)
  3. attribute expressions with spaces between the equal sign and the attribute name and/or attribute value

The table shows which methods were able to handle which cases:
scanregexpstring mapsplit
1.
2.
3.

This means that the regexp, the string map, and the split methods are all flexible enough to handle just about any case of "split[ing] an argument which contains spaces within quotes into proper name=value pairs" (keeping in mind the equal-sign problem with the string map method). But can they handle full HTML attribute syntax? Not even close. This syntax has different rules for which characters may legally appear inside tag names, attribute names, and attribute values. Empty attributes can be written as just an attribute name, without any equal sign or (empty) value in quotes. Attribute values can be unquoted. Attribute values can be single-quoted. (XML attribute syntax is a little less complex.)

So what does it take to scan an HTML start tag?

It takes something like this. Note: this is a hand-made scanner, not a real industrial-grade scanner, but it does handle most of HTML's attribute syntax, specifically HTML5 (the part that is missing is the "Shorttag" rules (except for the unquoted attribute values), which is a bit of a mess and not used very often). When putting together this example, I debated with myself whether to support HTML character references / entities (the &amp; strings). The scanner does support them, but almost half the state space is used to scan those.

This is a basic table-driven scanner. It works by looking at one character at a time, classifying it into one of the different character classes (they appear as subkeys in the table below, symbolized by a pair of lower-case letters) and then cross-referencing the current state (one of the S00 -- S28 major keys in the table) with that character class to get a new state (if the character class is legal at this point in the string) or ERR (if it is illegal). The special character class ei (for end of input) can lead to the state ACC, or accepted string.

If the scanner accepts the string, it returns -1. If an error occurred, it returns the character index where the error happened. If it returns a value greater than the last index in the string, it was expecting more text to complete a token. The command sets a variable called result in the caller's stack frame. I also added a convenience command, showscan, that either prints an error message, or returns a list consisting of the tag name and a dictionary containing the attribute names and values.)
set stateMatrix {
    S00 {ei ERR ex S01 af S01 gz S01 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S01 {ei ACC ex S01 af S01 gz S01 nu S01 sp S02 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S02 {ei ACC ex S04 af S04 gz S04 nu S04 sp S02 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am S04 ha S04 sc S04 pr S04}
    S03 {ei ACC ex ERR af ERR gz ERR nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S04 {ei ACC ex S04 af S04 gz S04 nu S04 sp S05 dq ERR sq ERR gt ERR lt ERR sl ERR eq S06 bt ERR am S04 ha S04 sc S04 pr S04}
    S05 {ei ACC ex S04 af S04 gz S04 nu S04 sp S05 dq ERR sq ERR gt ERR lt ERR sl ERR eq S06 bt ERR am S04 ha S04 sc S04 pr S04}
    S06 {ei ERR ex S07 af S07 gz S07 nu S07 sp S06 dq S09 sq S12 gt ERR lt ERR sl ERR eq ERR bt ERR am S07 ha S07 sc S07 pr S07}
    S07 {ei ACC ex S07 af S07 gz S07 nu S07 sp S08 dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am S07 ha S07 sc S07 pr S07}
    S08 {ei ACC ex S04 af S04 gz S04 nu S04 sp S08 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am S04 ha S04 sc S04 pr S04}
    S09 {ei ERR ex S10 af S10 gz S10 nu S10 sp S10 dq S11 sq S10 gt S10 lt S10 sl S10 eq S10 bt S10 am S15 ha S10 sc S10 pr S10}
    S10 {ei ERR ex S10 af S10 gz S10 nu S10 sp S10 dq S11 sq S10 gt S10 lt S10 sl S10 eq S10 bt S10 am S15 ha S10 sc S10 pr S10}
    S11 {ei ACC ex ERR af ERR gz ERR nu ERR sp S08 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S12 {ei ERR ex S13 af S13 gz S13 nu S13 sp S13 dq ERR sq S13 gt S13 lt S13 sl S13 eq S13 bt S13 am S22 ha S13 sc S13 pr S13}
    S13 {ei ERR ex S13 af S13 gz S13 nu S13 sp S13 dq S13 sq S14 gt S13 lt S13 sl S13 eq S13 bt S13 am S22 ha S13 sc S13 pr S13}
    S14 {ei ACC ex ERR af ERR gz ERR nu ERR sp S08 dq ERR sq ERR gt ERR lt ERR sl S03 eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S15 {ei ERR ex S20 af S20 gz S20 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha S16 sc ERR pr ERR}
    S16 {ei ERR ex S18 af ERR gz ERR nu S17 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S17 {ei ERR ex ERR af ERR gz ERR nu S17 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S21 pr ERR}
    S18 {ei ERR ex ERR af S19 gz ERR nu S19 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S19 {ei ERR ex ERR af S19 gz ERR nu S19 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S21 pr ERR}
    S20 {ei ERR ex S20 af S20 gz S20 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S21 pr ERR}
    S21 {ei ERR ex S10 af S10 gz S10 nu S10 sp S10 dq S11 sq S10 gt S10 lt S10 sl S10 eq S10 bt S10 am S15 ha ERR sc ERR pr S10}
    S22 {ei ERR ex S27 af S27 gz S27 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha S23 sc ERR pr ERR}
    S23 {ei ERR ex S25 af ERR gz ERR nu S24 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S24 {ei ERR ex ERR af ERR gz ERR nu S24 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S28 pr ERR}
    S25 {ei ERR ex ERR af S26 gz ERR nu S26 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc ERR pr ERR}
    S26 {ei ERR ex ERR af S26 gz ERR nu S26 sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S28 pr ERR}
    S27 {ei ERR ex S27 af S27 gz S27 nu ERR sp ERR dq ERR sq ERR gt ERR lt ERR sl ERR eq ERR bt ERR am ERR ha ERR sc S28 pr ERR}
    S28 {ei ERR ex S13 af S13 gz S13 nu S13 sp S13 dq S13 sq S14 gt S13 lt S13 sl S13 eq S13 bt S13 am S22 ha ERR sc ERR pr S13}
}

proc getCharacterClass char {
    switch -regexp -nocase -- $char {
        {^$}          { return ei }
        {[a-f]}       { return af }
        x             { return ex }
        {[g-z]}       { return gz }
        {[0-9]}       { return nu }
        {[ \t\n\f\r]} { return sp }
        \x22          { return dq }
        '             { return sq }
        >             { return gt }
        <             { return lt }
        /             { return sl }
        =             { return eq }
        `             { return bt }
        &             { return am }
        {#}           { return ha }
        {;}           { return sc }
        {[[:print:]]} { return pr }
        default       { return -- }
    }
}

proc getNextState {cclass state} {
    if {[dict exists $::stateMatrix $state $cclass]} {
        dict get $::stateMatrix $state $cclass
    } else {
        return ERR
    }
}

proc storetoken {} {
    upvar 1 token token result result
    if {[string length $token]} {
        lappend result $token
        set token {}
    }
}

proc adjustlength {} {
    upvar 1 result result
    if {!([llength $result] % 2)} {
        lappend result {}
    }
}

proc scanStartTag input {
    set state S00
    set ci 0
    set token {}

    upvar 1 result result
    set result {}

    while true {
        set c [string index $input $ci]
        set cclass [getCharacterClass $c]
        set state [getNextState $cclass $state]
        switch -- $state {
            ERR {
                storetoken
                adjustlength
                return $ci
            }
            ACC {
                storetoken
                adjustlength
                return -1
            }
            S00 - S03 - S09 -
            S12 {
            }
            S01 - S07 - S10 - S13 - S15 -
            S16 - S17 - S18 - S19 - S20 -
            S21 - S22 - S23 - S24 - S25 -
            S26 - S27 -
            S28 {
                append token $c
            }
            S02 - S05 - S06 - S08 - S11 -
            S14 {
                storetoken
            }
            S04 {
                adjustlength
                append token $c
            }
            default {
                error "Unknown state $state"
            }
        }
        incr ci
    }
}

proc showscan input {
    set rv [scanStartTag $input]
    if {$rv < 0} {
        list [lindex $result 0] [dict create {*}[lrange $result 1 end]]
    } else {
        puts stderr $input
        puts stderr [format %*s $rv ^]
        puts stderr "Unspecified syntax error at #$rv"
    }
}

AMG: The Wibble implementation contains regexp-based code to solve a similar problem. Search for "::wibble::delist" on that page, or refer to the following code snippet:
# Decode header list encoding.
proc ::wibble::delist {separator str} {
    regexp -all -inline [dict get {
semicolon {(?:[^;"=]+=)?(?:[Ww]/)?"(?:[^\\"]|\\.)*"|\((?:[^\\()]|\\.)*\)|[^;]+}
comma     {(?:[^,"=]+=)?(?:[Ww]/)?"(?:[^\\"]|\\.)*"|\((?:[^\\()]|\\.)*\)|[^,]+}
semicomma {(?:[^;,"=]+=)?"(?:[^\\"]|\\.)*"|\((?:[^\\()]|\\.)*\)|[^;,]+}
space     {"(?:[^\\"]|\\.)*"|\((?:[^\\()]|\\.)*\)|[^"()\\\s]+}
    } $separator] $str
}

The $separator argument primarily indicates which separator or separators are acceptable between elements: semicolons, commas, either semicolons or commas, or whitespace. Elements can be surrounded by double quotes or parentheses, in which case semicolons, commas, or what-have-you are ignored inside the quoting characters. Quoting characters themselves can be included verbatim by being preceded by a backslash. Elements can be preceded by an attribute name (ending with equal sign) and/or a weak tag ("w/" or "W/"), both of which come before the quotes.

The core of these regular expressions is: "(?:[^\\"]|\\.)*" . This matches an opening quote, zero or more atoms, then the closing quote. By "atom" I mean either (1) one character that's not a backslash and not a quote, or (2) one backslash followed by any one character at all. This is the magical incantation you will want to incorporate into your own regular expressions which handle double quotes and backslashes.

AM Also see: Splitting a string on arbitrary substrings

aspect: above it says "this is a hand-made scanner". Can you talk a bit about how that scanner was made? I assume it didn't involve writing that transition matrix by hand, which hints at some interesting code generation.

PYK: I'm curious about this as well.

[name redacted]: By "hand-made" I mean that I didn't use any of the usual scanner generator tools, because I wanted the code to be readable (the output of a scanner generator is usually more machine-oriented than human-oriented) and I didn't care about efficiency.

Still, I sort of used such a tool, only it didn't exist yet (let's call it Imaginary ad-hoc Scanner Generator Tool, IahSGT). Like most such tools, IahSGT has a definition language. Instead of the usual C-like syntax, it has Tcl-like syntax. The first step was to translate the HTML5 start tag definitions to statements in IahSGT's definition language, which I made up as I went along and then reworked some times to make it convenient and easy to read (not really necessary since I was going to throw it away later, but I couldn't help myself). The completed DL script told me what the scanner would have to be able to do, in terms of state transitions, input classification, and basic output. A scanner's output is typically very straightforward, like an alternating list of token classes and token strings. My scanner would have to be a little more sophisticated, producing a result in the same format as in the earlier examples. That meant that the output had to be (dummy, to begin with) executable primitives. When running the DL script through IahSGT, it dynamically created a scanner whose transition matrix, input dictionary, and scanner engine were all transient and hidden from sight. A test harness fed HTML fragments into this dynamic scanner and checked the resulting list of output primitives. I went through some iterations of refining the output dictionary and the generator engine until I was satisfied that the output was sufficiently simple and that everything was being scanned correctly. At that point I changed the generator engine to instead dump the matrix (as the stateMatrix structure), the input dictionary (as the getCharacterClass command definition), and the basic scanner engine (similar to what would later become the scanStartTag command). Then I wrote a real scanner engine based on this that included real primitives (the list of primitives eventually boiled down to storetoken, adjustlength, append and return). The test harness verified that it was correct, but the first version was unnecessarily complex. I rewrote it a bit some days later (adding the ei input class, among other things), and the result is what you see here. I don't know if this description helps any; it's all simpler than it sounds, really. Tcl is excellently convenient for building domain-specific languages and Code Generation engines while allowing for informal specifications and iterative development.

aspect: Great story, thanks for sharing! Pity the intermediate code didn't get published. While I understand it was written to be thrown away (and that's part of the charm of the tale!), I think illustrations of this sort of development style deserve to be highlighted. Highly [iterative development] with a heavy dose of code generation and [domain-specific languages] is a beautiful process in a language like Tcl, and has the potential to inspire people who (coming from less dynamic languages, or more orthodox education) might not have seen anything like it before. I have dreams of a "Tcl-craft anecdotes" blog for brief stories such as this ;-).

[name redacted]: My pleasure :) Tcl is radically different from most languages, and developing Tcl code can be exhilarating - but apparently also utterly baffling for beginners and traditionalists. How do the most popular languages compare to Tcl? Well, C is a dirt bike, Java (like C++/C#/Objective C) is a humvee; Tcl is a helicopter. When I think about why Tcl isn't popular, I suppose most people (correctly) imagine it would be very hard to drive it cross-country.