Regular Expressions Are Not A Good Idea for Parsing XML, HTML, or e-mail Addresses

[1] - NEM This link is dead for me on 2 June, 2005. This [2] article from comp.lang.tcl certainly looks relevant, however.

[Wiki page on e-mail addresses]

[different meanings of "regular expressions"]

[ Perl disease]

[When REs go wrong]

Regular expression examples

05Apr03 Brian Theado - For XML, I'm guessing the title of this page is referring to one-off regular expressions, but see [3] for a paper describing shallow parsing of XML using only a regular expression. The regular expression is about 30 lines long, but the paper documents it well. The Appendix includes sample implementation in Perl, Javascript and Flex/Lex. The Appendix also includes an interactive demo (using the Javascript implementation apparently). The demo helped me understand what they meant by "shallow parsing". For a Tcl translation, see XML Shallow Parsing with Regular Expressions.

Why are regular expressions not suited for parsing email addresses? "Regular expression to validate e-mail addresses" comments on this.

A few more comments appear in "The Limits to Regular Expressions" [4] and "Regular Expressions Do Not Solve All Problems" [5], themselves descendants of Jamie Zawinski's notorious judgment [6]

[7] perhaps speaks for itself. [8] surely does.

REs multiply, rather than solve, problems.

D. McC: OK, so what can you use instead of REs to solve, rather than multiply, problems?

AM In Tcl you have a number of options, depending on what you really want to do:

Searching for individual words - consider [lsearch]
Searching for particularly simple patterns - consider [string match]
Try coming up with simple REs that solve the matching problem to, say, 80 or 90% and use a second step to get rid of the "false positives"
Use a combination of all three
If you are trying to match text that spans multiple lines, not uncommon, turn it into one long string first, removing any unnecessary characters (like \ or \n)

That is just a handful of methods. I am sure others can come up with more methods.

DKF: For XML and HTML, use a proper parser to build a DOM tree. For email addresses, do a cheap hack that does the 99.999% of the cases seen in practice. :^)

NEM: I'm not sure if [lsearch] or [string match] would be the way to go if [regexp] wasn't good enough. The direction I'd go in would be to use one of the many parser generators available for Tcl (e.g. I've heard good things about taccle), or check out some of the tools in tcllib (look at the grammar_fa stuff by AKU). Or, you could roll your own parser using recursive descent. At some point soon, I'd like to experiment with parser combinators [9], which look great. Note that most of these techniques probably make use of regexps as part of the solution. However, regular expressions, in their most basic form, can only recognise regular grammars (see [10] for a description of the Chomsky language hierarchy), but many times what needs to be parsed is context-free or context-sensitive (XML is context-free, IIRC).

AM I use [lsearch] and [string match] for identifying lines of interest - quite often the first thing you need to do. I do not intend them as replacements for splitting up the text in smaller pieces...

CMcC It should be realised that the shallow parser in XML Shallow Parsing with Regular Expressions treats attributes and DOCTYPE as opaque - it doesn't attempt to parse them, and therefore the regexp parser doesn't parse XML, but rather a simpler language that looks a lot like XML (and is for all practical purposes largely equivalent.)

Above, NEM speaks, correctly, of parsing as identical in meaning to recognising. It is important to remember however, that very few people are interested merely in answering the question is this string a valid XML document?.

Most of what we mean when we speak of parsing is actually translation - we wish to take an XML DI (or a putative XML DI) and transform it into something else with which one can directly work, for example a tree. It may be possible to translate an XML DI into something without ever recognising it. That translation and translators often include a parser, or something we call a parser, does not mean that all translation is predicated on parsing.

It may be that a process which does not parse XML (in that it fails to recognise that a given string does not conform to the XML grammar) nonetheless produces a coherent, isomorphic, structure- and semantic-preserving translation for those strings which happen to conform to the XML grammar.

As the paper referenced in the shallow parsing page points out, this behaviour may actually be more desirable than parsing per se.

Nothing I've written here should be taken to mean that I think unrestricted, undisciplined, indiscriminate, careless or naive use of regexps is a good thing. None of those adjectives describe the shallow XML parser used as a translator, so I feel the title of this page is insufficiently nuanced.

NEM I disagree with the argument that you can do a meaningful translation without doing any parsing. The XML shallow parser is a parser, just not a complete one. Perhaps that is sufficient for many problems, but it is still parsing. You have to recognise something in order to perform any meaningful operation on some data. Now, you can go far with regular expressions, and as implemented in most languages they are more powerful than the regular expressions of formal language theory. However a brief look over the XML shallow parser source code to me reveals that using solely regexps for any substantial pattern matching is not a good idea, which is what this page is all about (not that I started this page). Perhaps we should reverse the question: what are considered to be the advantages to using regexps for this sort of work? Certainly not simplicity or code clarity, I'd guess.

CMcC the argument is rather that one can do meaningful translation of XML DIs without parsing XML. The XML shallow parser is a complete parser, just not of XML.

I don't know that the contention 'you have to recognise something in order to perform a meaningful operation' is well founded in all its generality (consider [incr x].)

I think, to be fair, a look over the shallow parser needs to be balanced against a similar perusal of a pure-tcl or a C implementation. It's certainly shorter than those, which leads me to this answer to your question: It may be that a regexp based translation runs faster than a pure-tcl less regexp'd translation.

NEM Well, calling the shallow parser complete seems like moving the goal posts a bit: regexps can't parse XML on their own, so let's parse something else instead (and in the process, partially tokenize some XML). I'm not sure what the [incr x] example shows: the value of $x has to be parsed in some way to ensure that it is an integer (e.g. the parsing is done by Get_IntFromObj when converting the string rep into an integer rep).

Yes, the shallow parser is shorter and faster than any pure-Tcl non-regexp equivalent, I'd guess with some certainty. But that is only because a regexp compiler is built into the Tcl standard library. If you took the regexp source code along with the C source of the regexp engine, then I suspect the resulting code would be of comparable complexity to tdom + "dom parse $xml", for instance.

CMcC taking NEM's response backwards: using a tcl built-in regexp compiler is no more unreasonable than using a tcl built-in xml parser (were one available.) It is an optimisation (like generating code for an ALU when one is compiling C.)

Regarding the [incr x] counter-argument: the data to which $x refers is not, itself, parsed (certainly not at the tcl language level.) It is, of course, necessary to recognise tcl prior to evaluating it, but this doesn't bear on whether it is necessary to parse an XML DI prior to translating it. Incidentally, it is quite possible to evaluate tcl and produce useful results without parsing it (thus: [if {0} {{mumble}wibble} else {something_useful}])

Translating something in several phases is a reasonable thing to do. If XML' (being the XML-like language recognised by the regexp) is readily recognised by a built-in facility, and if parsing it gets at the part of the XML DI you're interested in (e.g. not the DOCTYPE) then it's effective too.

It is no real criticism of the regexps to state (correctly, if tacitly) that it is not XML-complete, if it is XML'-complete, and XML' is what the application is interested in.

I'll some questions: if XML-completeness would reject a near-DI which was valid XML but for an ill-formed DOCTYPE, where XML'-completeness would fail to notice, or would notice in a distinct phase of processing, is that a bad thing?

If, as the regexp's author suggests, XML were respecified (by the W3) in such a way as to respecify those parts of XML which differed from XML' (or some XML' which was regexpable) would that be a bad thing?

Given the omnipresence of XML (and, let's be honest, XML' as badly formed XML) in the wild, is it a bad thing to have a lightweight pure tcl way to (mostly) handle it?

I have no problem with the statement that regexps are often a very bad idea, or with discouraging people from thinking it's an easy thing to parse using regexps. I don't even particularly like regexps. However, when the admonition becomes an orthodoxy it is counterproductive.

NEM: I don't really want to argue this further. Using the built-in regexp parser is not unreasonable, but it should be recognised that regexps are not the be-all-and-end-all of parsing. In fact, they are at the bottom rung of parsing technology, and people should be aware of that.

I don't know what you mean by saying that $x is not parsed or that "it is quite possible to evaluate tcl ... without parsing it". To me, this statement is absurd. What exactly is the Tcl parser doing then? Just because you've not personally parsed something doesn't mean that it hasn't been parsed.

To answer your other questions:

I never stated that multi-stage translation was a bad thing.
If the regexps parse XML' but not XML then it should be called an XML' parser, and not an XML parser, and no claim should be made that regexps are sufficient for parsing XML.
Given the omnipresence of XML, is it a bad thing to have tools which only partially handle it? Yes! How many RSS feeds do I still encounter which aren't valid XML? How many software engineers still insist on rolling their own parsers/generators using regexps etc which seemed good enough at the time? Wasn't one of the goals of XML to save us from this?

I would very much discourage the culture of thinking that regexps are a valid solution to many parsing problems: In reality, they are the solution to a quite a few lexing problems as part of a larger parser framework. Others have already solved the problems of parsing/generating XML -- tdom and TclXML are the results.

I don't want to rant overly against the use of regexps -- I use them very frequently myself, and they are a powerful tool. But if we want to cultivate a culture of "pick the right tool for the job", then we need to recognise that regexps are not always the right tool!

KPV The reason why I, and I suspect lots of other people, tend towards using RE to parse XML is that tdom and TclXML feel like such a big hammer for what seems like a simple problem. For example, my most recent XML problem involved extracting latitude and longitude values for a list of waypoints in a GPX file. The RE solution took just a dozen or so lines of code. It may fail on some pathological, user-contrived data but in practice it works just fine. The alternative of building a DOM tree and traversing it seems like overkill. Once you throw in cross-platform and packaging issues, the RE solution starts looking better.

Fundamentally, the real problem is that XML is overkill for lots of data that would be better off in some simpler format.

10jun05 jcw - If we had something like Parsing HTML in Tcl, in a simple, fast, and minimal way, much of the need for bigger XML solutions might be reduced. I've got a davkit implementation which would fit in tcllib just fine (it was submitted for inclusion months ago), except that even for such simple uses it can't be included because it would require either TclXML or tDOM. Can't we add some simple functionality to Tcl (C if need be) to slay this dragon?

IOW, if the Tcl core can have a hefty regexp command in there, why not some sort of tokenizer which is sufficient to deal with the rest of the task in Tcl scripts?

NEM That's an excellent idea. Parsing is such a common task that it'd be really useful to have some really powerful general-purpose parsing framework in the core (or as a package distributed with the core, like http). I think something like Parser Combinators, only much more optimised (lots of low-level parsers/scanners for e.g. Tcl-lists, numbers, etc written in C) and more Tcl-ish (the monad stuff is rather clumsy in Tcl) could fit the bill, but would take some time to develop. There are various alternatives too. As everything is a string in Tcl, we should make sure that Tcl is top-class for string processing. It's already much better than most languages, but I think we can go further.

MR Stephen Uhler's circa-1995 web browser in Tcl includes a parser procedure (HMparse_html) which calls back to a procedure you supply with info on each tag it finds etc. I use that as a starting point for the web app test suite for ProjectForum.

AK The core parsing part of this code is actually in Tcllib, under the name htmlparse. Help on making this better is welcome.