TAX: A Tiny API for XML

NB: TAX Revisited is probably a better guide to this code if you want to understand the exec calls etc.

TAX was inspired by Stephen Uhler's HTML parser in 10 lines. In fact, the code is almost exactly the same. Just a couple of extra bells and whistles.

TAX, the Tiny API for XML, is vaguely similar to SAX in that for both TAX & SAX, XML is handled by defining a handler for tags. Otherwise, TAX & SAX have more differences than similarities. The most important difference is that with TAX, both the XML and the processed XML reside in memory, so TAX makes inefficient use of memory. For this reason, it's best for small XML files. In contrast, SAX is an event-driven parser that operates on a stream, so the XML doesn't have to all be loaded into memory. (Of course, another important difference is that SAX is a mature, full-featured, well-documented, and well-supported API.)

As with Stephen Uhler's gem, TAX takes an XML file and converts it into a Tcl script. Tags become procs. The XML is then executed by eval'ing the script.

Here's the essential code:

EKB Somebody "cleaned this up" to the point where it no longer works! The numerous backslashes are hairy and distracting, but they are necessary. The only way to avoid them is to put the regexps in braces, but then you can't do variable substitution, so quotation marks are a must, and that means a lot of backslashes. I've reverted to the original

    ############################################################
    #
    # Based heavily on Stephen Uhler's HTML parser in 10 lines
    # Modified by Eric Kemp-Benedict for XML
    #
    # Turn XML into TCL commands
    #   xml     A string containing an html document
    #   cmd     A command to run for each html tag found
    #   start   The name of the dummy html start/stop tags
    #
    # Namespace "tax" stands for "Tiny API for XML"
    #

    namespace eval tax {}
    
    proc tax::parse {cmd xml {start docstart}} {
        regsub -all \{ $xml {\&ob;} xml
        regsub -all \} $xml {\&cb;} xml
        set exp {<(/?)([^\s/>]+)\s*([^/>]*)(/?)>}
        set sub "\}\n$cmd {\\2} \[expr \{{\\1} ne \"\"\}\] \[expr \{{\\4} ne \"\"\}\] \
            \[regsub -all -- \{\\s+|(\\s*=\\s*)\} {\\3} \" \"\] \{"
        regsub -all $exp $xml $sub xml
        eval "$cmd {$start} 0 0 {} \{ $xml \}"
        eval "$cmd {$start} 1 1 {} {}"
    }

To use it, create a parser command, cmd, that will handle any tag found in the string xml. The parser calls cmd in the following way:

 cmd tag cl selfcl props body

where

tag is the tag (e.g., p, br, h1, etc. from HTML) or the special tag "docstart"
cl is a boolean saying if this is a closing tag (e.g., like </p>)
selfcl is a boolean saying if this is a self-closing tag (e.g., <br/> for XHTML)
props is a list of name/value pairs that can be passed to an array using array set
body is text following the tag that is not enclosed in a tag (e.g., for <p>My text</p>, "My text" is the body)

Here's an example of use (that also uses snit to build the parser -- there's one snit method for each tag).

EKB I reverted this code to the original. This should run.

    package require snit

    ############################################################
    #
    # Based heavily on Stephen Uhler's HTML parser in 10 lines
    # Modified by Eric Kemp-Benedict for XML
    #
    # Turn XML into TCL commands
    #   xml     A string containing an html document
    #   cmd     A command to run for each html tag found
    #   start   The name of the dummy html start/stop tags
    #
    # Namespace "tax" stands for "Tiny API for XML"
    #

    namespace eval tax {}
    
    proc tax::parse {cmd xml {start docstart}} {
        regsub -all \{ $xml {\&ob;} xml
        regsub -all \} $xml {\&cb;} xml
        set exp {<(/?)([^\s/>]+)\s*([^/>]*)(/?)>}
        set sub "\}\n$cmd {\\2} \[expr \{{\\1} ne \"\"\}\] \[expr \{{\\4} ne \"\"\}\] \
            \[regsub -all -- \{\\s+|(\\s*=\\s*)\} {\\3} \" \"\] \{"
        regsub -all $exp $xml $sub xml
        eval "$cmd {$start} 0 0 {} \{ $xml \}"
        eval "$cmd {$start} 1 1 {} {}"
    }

   snit::type parser {
        proc compactws {s} {
            return [regsub -all -- {\s+} [string trim $s] " "]
        }
        method docstart {cl args} {
            if $cl {
                puts "...End document"
            } else {
                puts "Start document..."
            }
        }
        method para {cl selfcl props body} {
            array set temp $props
            if {!$cl} {
                set outstring [compactws $body]
                if [info exists temp(indent)] {
                    set outstring "[string repeat { } $temp(indent)]$outstring"
                }
                puts $outstring
            }
        }
        method meta {cl selfcl props body} {
            array set temp $props
            foreach item [array names temp] {
                puts "[string totitle $item]: $temp($item)"
            }
            if {!$selfcl} {
                puts [compactws $body]
            } else {
                puts ""
            }
        }
   }

   parser myparser

   tax::parse myparser {
    <meta author="Anne Onymous"/>
    <meta>
        Composed in haste for purposes of demonstration.
    </meta>
    <para indent="3">
      This is an indented paragraph. Only the first line
      is indented, which you can tell if the paragraph goes
      on long enough.
    </para>
    <para>
      This is an ordinary paragraph. No line is indented. Not
      one. None at all, which you can tell if the paragraph
      goes on long enough.
    </para>
   }

It gives this output:

 Start document...

 Author: Anne Onymous

 Composed in haste for purposes of demonstration.

    This is an indented paragraph. Only the first line is indented, which you can tell if the paragraph goes on long enough.
 This is an ordinary paragraph. No line is indented. Not one. None at all, which you can tell if the paragraph goes on long enough.

 ...End document

EKB wrote this, right? EKB Yes, I admit it!

EF A slightly revisited version of this, with support for accessing the XML tag tree from within the callback is available at TAX Revisited.

EKB - I looked at TAX Revisited and it's nifty! I was just keeping track of state with my own variables; it's nice to have it done for you. I also agree that handling self-closing tags separately from normal tags is awkward: <tag/> and <tag></tag> should look identical.

EF I have made the revisited implementation [1] part of the TIL. I hope that you don't mind. The source makes full mention to the Wiki reference.

EKB I don't mind at all. Great!

replace those regsubs with string map and you'll probably see a performance boost...

EKB Are they all replaceable? It looks like:

        regsub -all { $xml {&ob;} xml
        regsub -all } $xml {&cb;} xml

can be replaced by

        set xml [string map {{ &ob; } \%cb;} $xml]

but I think the others have to be regexps.

DAG Just a note: actual SAX implementations also gulp the whole XML stream before processing, since are node oriented, and being the XML object always a node, this is read wholly before beginning parsing. StAX, on other end, is a pull-based parser, which is more stream oriented, and does note require to read the whole node before parsing it. I found it much better than SAX.

JE That's incorrect: SAX is a streaming API, the whole point of SAX is that the implementation doesn't need to parse the whole document up-front. Perhaps you were thinking of DOM?

EF While working on a Tcl interface to the Flickr API, I discovered that the parser above would fail when some attributes in an XML tag contained forward slashes. I have been peeking into the code from the htmlparse module and found out the following solution. The solution consists in leaving off the trailing slash of empty tags (those ending with a slash and a less than!) and returning the boolean telling if tags are self closing by testing the presence of the trailing slash in the substitution string instead. Since I also experimented problems with the replacement of the equal sign by spaces (this should """not""" been done within attribute values!), I ended up with the following code. Please try and tell me if these improvements are worth it. EKB Seems worth it to me! It's easy to imagine a tag with forward slashes (for example, a URL).

Roy Keene This seems to break for data that has 'CDATA' entries. For example:

  <url><![CDATA[http://example.com]]></url>

gets parsed really poorly. The sequence

 <![...]>

needs to be accounted for somehow, more generally. (This could be done easily if regsub supported callbacks to determine the replacement text from the source pattern...)

EKB See below for a solution to this! (Only CDATA, not generic <!...>.)

 # ::tax::__cleanprops -- Clean parsed XML properties
 #
 #        This command cleans parsed XML properties by removing the
 #        trailing slash and replacing equals by spaces so as to produce
 #        a list that is suitable for an array set command.
 #
 # Arguments:
 #        props        Parsed XML properties
 #
 # Results:
 #        Return an event list that is suitable for an array set
 #
 # Side Effects:
 #        None.
 proc ::tax::__cleanprops { props } {
     set name {([A-Za-z_:]|[^\x00-\x7F])([A-Za-z0-9_:.-]|[^\x00-\x7F])*}
     set attval {"[^"]*"|'[^']*'|\w}; # "... Makes emacs happy
     return [regsub -all -- "($name)\\s*=\\s*($attval)" \
                 [regsub "/$" $props ""] "\\1 \\4"]
 }
 
 
 # ::tax::parse -- Low-level 10 lines magic parser
 #
 #        This procedure is the core of the tiny XML parser and does its
 #        job in 10 lines of "hairy" code.  The command will call the
 #        command passed as an argument for each XML tag that is found
 #        in the XML code passed as an argument.  Error checking is less
 #        than minimum!  The command will be called with the following
 #        respective arguments: name of the tag, boolean telling whether
 #        it is a closing tag or not, boolean telling whether it is a
 #        self-closing tag or not, list of property (array set-style)
 #        and body of tag, if available.
 #
 # Arguments:
 #        cmd        Command to call for each tag found.
 #        xml        String containing the XML to be parsed.
 #        start        Name of the pseudo tag marking the beginning/ending of document
 #
 # Results:
 #        None.
 #
 # Side Effects:
 #        None.
 proc ::tax::parse {cmd xml {start docstart}} {
     regsub -all \{ $xml {\&ob;} xml
     regsub -all \} $xml {\&cb;} xml
     set exp {<(/?)([^\s/>]+)\s*([^>]*)>}
     set sub "\}\n$cmd {\\2} \[expr \{{\\1} ne \"\"\}\] \[regexp \{/$\} {\\3}\] \
              \[::tax::__cleanprops \{\\3\}\] \{"
     regsub -all $exp $xml $sub xml
     eval "$cmd {$start} 0 0 {} \{$xml\}"
     eval "$cmd {$start} 1 0 {} {}"
 }

EKB Handling CDATA Sections

Here's a version of the code + sample that handles CDATA sections (see Roy Keene's comment above). Since the idea with CDATA is that it be taken literally, the strategy is to get it out of the way before the parsing is done using tax::preparse and replace the text with a token. Then at the end, the tokens are replaced with the original text with tax::postparse.

This being said, I should emphasize that my original thinking with TAX is that it would be used to handle parsing of XML config files and similar small tasks, so (as originally conceived) it wouldn't have all the bells and whistles that a full XML-compliant parser would have. But CDATA can certainly end up even in a small application-specific XML file, and it was an interesting challenge!

    package require snit

    ############################################################
    #
    # Based heavily on Stephen Uhler's HTML parser in 10 lines
    # Modified by Eric Kemp-Benedict for XML
    #
    # Turn XML into TCL commands
    #   xml     A string containing an html document
    #   cmd     A command to run for each html tag found
    #   start   The name of the dummy html start/stop tags
    #
    # Namespace "tax" stands for "Tiny API for XML"
    #

    namespace eval tax {
        variable cdata
    }
    
    proc tax::preparse {xml} {
        variable cdata
        
        set cdata {}
        
        # Collect cdata in namespace variable
        # Use a non-greedy match on the body
        set re {<\!\[CDATA\[(.+?)\]\]>}
        set n 0
        while {[regexp -- $re $xml -> t] == 1} {
            lappend cdata $t
            regsub -- $re $xml "TAX:CDATA$n" xml
            incr n
        }
        return $xml
    }
    
    proc tax::postparse {xml} {
        variable cdata
        
        set n 0
        foreach t $cdata {
            regsub -- "TAX:CDATA$n" $xml $t xml
            incr n
        }
        return $xml
    }
    
    proc tax::parse {cmd xml {start docstart}} {
        set xml [preparse $xml]
        
        regsub -all \{ $xml {\&ob;} xml
        regsub -all \} $xml {\&cb;} xml
        set exp {<(/?)([^\s/>]+)\s*([^/>]*)(/?)>}
        set sub "\}\n$cmd {\\2} \[expr \{{\\1} ne \"\"\}\] \[expr \{{\\4} ne \"\"\}\] \
            \[regsub -all -- \{\\s+|(\\s*=\\s*)\} {\\3} \" \"\] \{"
        regsub -all $exp $xml $sub xml
        
        set xml [postparse $xml]
        
        eval "$cmd {$start} 0 0 {} \{ $xml \}"
        eval "$cmd {$start} 1 1 {} {}"
    }

   snit::type parser {
        proc compactws {s} {
            return [regsub -all -- {\s+} [string trim $s] " "]
        }
        method docstart {cl args} {
            if $cl {
                puts "...End document"
            } else {
                puts "Start document..."
            }
        }
        method para {cl selfcl props body} {
            array set temp $props
            if {!$cl} {
                set outstring [compactws $body]
                if [info exists temp(indent)] {
                    set outstring "[string repeat { } $temp(indent)]$outstring"
                }
                puts $outstring
            }
        }
        method meta {cl selfcl props body} {
            array set temp $props
            foreach item [array names temp] {
                puts "[string totitle $item]: $temp($item)"
            }
            if {!$selfcl} {
                puts [compactws $body]
            } else {
                puts ""
            }
        }
   }

   parser myparser

   tax::parse myparser {
    <meta author="Anne Onymous"/>
    <meta>
        Composed in haste for purposes of demonstration.
    </meta>
    <para indent="3">
      This is an indented paragraph. Only the first line
      is indented, which you can tell if the paragraph goes
      on long enough. <![CDATA[<exampletag "Hi!">]]>
      <![CDATA[\example\path]]>
    </para>
    <para>
      This is an ordinary paragraph. No line is indented. Not
      one. None at all, which you can tell if the paragraph
      goes on long enough.
    </para>
   }

jbr - 2011-06-08 xml2list thingy

Here is a little code to convert xml to a tcl list, cribbed from and in the spirit of ::tax::parse.

proc ::tax::xml2list { xml } {
    regsub -all \{ $xml {\&ob;} xml
    regsub -all \} $xml {\&cb;} xml

    set xexp  {<\?([^\s/>]+)\s*([^>]*)\?>}
    set oexp  {<([^\s/>]+)\s*([^>]*)\??>}
    set cexp {</([^\s/>]+)\s*([^>]*)>}

    regsub -all {\[}  $xml \\\[ xml

    regsub      $xexp $xml { \1 { [::tax::__cleanprops {\2}] } \{}      xml
    regsub -all $oexp $xml { \1 { [::tax::__cleanprops {\2}] } \{}      xml
    regsub -all $cexp $xml \}                                           xml

    return "[subst $xml]\}"
}