HTML, or
HyperText Markup Language, is a
markup language used on the
World-Wide Web.
Parsing Tools edit
- Tcllib html
- a module for generating html
- htmlparse
- tools to parse html
- tkHTML
- an extension that parses and renders HTML, compiled for use without Tk
- tcltidy
- a wrapper to Tidy
- tkhtml3
- the successor to tkHTML
- tDOM's XPath-oriented parser
- can be used to manipulate HTML
- TclXML
- includes xmlgen for generating HTML or XML
- Tclgumbo
- An interface to the Gumbo HTML5 parsing library
Generation Tools edit
- html form generator, by CMcC
- Generate HTML forms from Tcl lists.
- MajaMaja
- structure and layout a static collection of html pages arranging a wide variety of materials
- Wub
- includes a utility for structured HTML tag generation
- Wiki format to HTML
See Also edit
- HTML widgets
- discusses widgets that render HTML into a visual representation.
- Web scraping
- august html editor
- url-encoding
- html2text
Description edit
For extracting data from HTML, it's generally more robust to parse the HTML page into some document model, perhaps using
tDOM, than to hack at it with regular expressions, and then using
XPath to find the data.
If the task is to 'pull out' some data out of a HTML page, I'm indeed a strong believer in the 'parse the HTML page into a tree and query that tree' approach. For real life problems, I claim that this approach is much simpler and easier to maintain - and for sure, you have to maintain such a thingy, because the layout of HTML pages tend to change frequently - than every regexp approach. Sure, you have to learn another query language - xpath in this case. But if you are really in the web business, there are chances you have to learn xpath anyway.