html

HTML, or HyperText Markup Language, is a markup language used on the World-Wide Web.

Parsing Tools edit

Tcllib html: a module for generating html
htmlparse: tools to parse html
tkHTML: an extension that parses and renders HTML, compiled for use without Tk
tcltidy: a wrapper to Tidy
tkhtml3: the successor to tkHTML
tDOM's XPath-oriented parser: can be used to manipulate HTML
TclXML: includes xmlgen for generating HTML or XML
Tclgumbo: An interface to the Gumbo HTML5 parsing library

Generation Tools edit

html form generator, by CMcC: Generate HTML forms from Tcl lists.
MajaMaja: structure and layout a static collection of html pages arranging a wide variety of materials
Wub: includes a utility for structured HTML tag generation
Wiki format to HTML

Description edit

For extracting data from HTML, it's generally more robust to parse the HTML page into some document model, perhaps using tDOM, than to hack at it with regular expressions, and then using XPath to find the data.

If the task is to 'pull out' some data out of a HTML page, I'm indeed a strong believer in the 'parse the HTML page into a tree and query that tree' approach. For real life problems, I claim that this approach is much simpler and easier to maintain - and for sure, you have to maintain such a thingy, because the layout of HTML pages tend to change frequently - than every regexp approach. Sure, you have to learn another query language - xpath in this case. But if you are really in the web business, there are chances you have to learn xpath anyway.

Parsing Tools edit

Generation Tools edit

See Also edit

Description edit