05Apr03 Brian Theado - For XML, I'm guessing the title of this page is referring to one-off regular expressions, but see [3] for a paper describing shallow parsing of XML using only a regular expression. The regular expression is about 30 lines long, but the paper documents it well. The Appendix includes sample implementation in Perl, Javascript and Flex/Lex. The Appendix also includes an interactive demo (using the Javascript implementation apparently). The demo helped me understand what they meant by "shallow parsing". For a Tcl translation, see XML Shallow Parsing with Regular Expressions.
Why are regular expressions not suited for parsing email addresses? "Regular expression to validate e-mail addresses" comments on this.A few more comments appear in "The Limits to Regular Expressions" [4] and "Regular Expressions Do Not Solve All Problems" [5], themselves descendants of Jamie Zawinski's notorious judgment [6][7] perhaps speaks for itself. [8] surely does.REs multiply, rather than solve, problems.
D. McC: OK, so what can you use instead of REs to solve, rather than multiply, problems?AM In Tcl you have a number of options, depending on what you really want to do:
- Searching for individual words - consider [lsearch]
- Searching for particularly simple patterns - consider [string match]
- Try coming up with simple REs that solve the matching problem to, say, 80 or 90% and use a second step to get rid of the "false positives"
- Use a combination of all three
- If you are trying to match text that spans multiple lines, not uncommon, turn it into one long string first, removing any unnecessary characters (like \ or \n)
CMcC It should be realised that the shallow parser in XML Shallow Parsing with Regular Expressions treats attributes and DOCTYPE as opaque - it doesn't attempt to parse them, and therefore the regexp parser doesn't parse XML, but rather a simpler language that looks a lot like XML (and is for all practical purposes largely equivalent.)Above, NEM speaks, correctly, of parsing as identical in meaning to recognising. It is important to remember however, that very few people are interested merely in answering the question is this string a valid XML document?.Most of what we mean when we speak of parsing is actually translation - we wish to take an XML DI (or a putative XML DI) and transform it into something else with which one can directly work, for example a tree. It may be possible to translate an XML DI into something without ever recognising it. That translation and translators often include a parser, or something we call a parser, does not mean that all translation is predicated on parsing.It may be that a process which does not parse XML (in that it fails to recognise that a given string does not conform to the XML grammar) nonetheless produces a coherent, isomorphic, structure- and semantic-preserving translation for those strings which happen to conform to the XML grammar.As the paper referenced in the shallow parsing page points out, this behaviour may actually be more desirable than parsing per se.Nothing I've written here should be taken to mean that I think unrestricted, undisciplined, indiscriminate, careless or naive use of regexps is a good thing. None of those adjectives describe the shallow XML parser used as a translator, so I feel the title of this page is insufficiently nuanced.NEM I disagree with the argument that you can do a meaningful translation without doing any parsing. The XML shallow parser is a parser, just not a complete one. Perhaps that is sufficient for many problems, but it is still parsing. You have to recognise something in order to perform any meaningful operation on some data. Now, you can go far with regular expressions, and as implemented in most languages they are more powerful than the regular expressions of formal language theory. However a brief look over the XML shallow parser source code to me reveals that using solely regexps for any substantial pattern matching is not a good idea, which is what this page is all about (not that I started this page). Perhaps we should reverse the question: what are considered to be the advantages to using regexps for this sort of work? Certainly not simplicity or code clarity, I'd guess.CMcC the argument is rather that one can do meaningful translation of XML DIs without parsing XML. The XML shallow parser is a complete parser, just not of XML.I don't know that the contention 'you have to recognise something in order to perform a meaningful operation' is well founded in all its generality (consider [incr x].)I think, to be fair, a look over the shallow parser needs to be balanced against a similar perusal of a pure-tcl or a C implementation. It's certainly shorter than those, which leads me to this answer to your question: It may be that a regexp based translation runs faster than a pure-tcl less regexp'd translation.NEM Well, calling the shallow parser complete seems like moving the goal posts a bit: regexps can't parse XML on their own, so let's parse something else instead (and in the process, partially tokenize some XML). I'm not sure what the [incr x] example shows: the value of $x has to be parsed in some way to ensure that it is an integer (e.g. the parsing is done by Get_IntFromObj when converting the string rep into an integer rep).Yes, the shallow parser is shorter and faster than any pure-Tcl non-regexp equivalent, I'd guess with some certainty. But that is only because a regexp compiler is built into the Tcl standard library. If you took the regexp source code along with the C source of the regexp engine, then I suspect the resulting code would be of comparable complexity to tdom + "dom parse $xml", for instance.CMcC taking NEM's response backwards: using a tcl built-in regexp compiler is no more unreasonable than using a tcl built-in xml parser (were one available.) It is an optimisation (like generating code for an ALU when one is compiling C.)Regarding the [incr x] counter-argument: the data to which $x refers is not, itself, parsed (certainly not at the tcl language level.) It is, of course, necessary to recognise tcl prior to evaluating it, but this doesn't bear on whether it is necessary to parse an XML DI prior to translating it. Incidentally, it is quite possible to evaluate tcl and produce useful results without parsing it (thus: [if {0} {{mumble}wibble} else {something_useful}])Translating something in several phases is a reasonable thing to do. If XML' (being the XML-like language recognised by the regexp) is readily recognised by a built-in facility, and if parsing it gets at the part of the XML DI you're interested in (e.g. not the DOCTYPE) then it's effective too.It is no real criticism of the regexps to state (correctly, if tacitly) that it is not XML-complete, if it is XML'-complete, and XML' is what the application is interested in.I'll some questions: if XML-completeness would reject a near-DI which was valid XML but for an ill-formed DOCTYPE, where XML'-completeness would fail to notice, or would notice in a distinct phase of processing, is that a bad thing?If, as the regexp's author suggests, XML were respecified (by the W3) in such a way as to respecify those parts of XML which differed from XML' (or some XML' which was regexpable) would that be a bad thing?Given the omnipresence of XML (and, let's be honest, XML' as badly formed XML) in the wild, is it a bad thing to have a lightweight pure tcl way to (mostly) handle it?I have no problem with the statement that regexps are often a very bad idea, or with discouraging people from thinking it's an easy thing to parse using regexps. I don't even particularly like regexps. However, when the admonition becomes an orthodoxy it is counterproductive.NEM: I don't really want to argue this further. Using the built-in regexp parser is not unreasonable, but it should be recognised that regexps are not the be-all-and-end-all of parsing. In fact, they are at the bottom rung of parsing technology, and people should be aware of that.I don't know what you mean by saying that $x is not parsed or that "it is quite possible to evaluate tcl ... without parsing it". To me, this statement is absurd. What exactly is the Tcl parser doing then? Just because you've not personally parsed something doesn't mean that it hasn't been parsed.To answer your other questions:
- I never stated that multi-stage translation was a bad thing.
- If the regexps parse XML' but not XML then it should be called an XML' parser, and not an XML parser, and no claim should be made that regexps are sufficient for parsing XML.
- Given the omnipresence of XML, is it a bad thing to have tools which only partially handle it? Yes! How many RSS feeds do I still encounter which aren't valid XML? How many software engineers still insist on rolling their own parsers/generators using regexps etc which seemed good enough at the time? Wasn't one of the goals of XML to save us from this?