Where: From the contact
Description: Tcl script which reads an HTML document and outputs the plain
text of the document. Designed to make it relatively easy for the
user to configure how the program should mark specific HTML tags.
Updated: 10/1997
Contact: mailto:jmoss@ichips.intel.com
(Joe Moss)CM May 14th 03 - Get the source from Joe Moss current Home-Page: http://www.psg.com/~joem/tcl/
Roi Dayan writes, at the Tcl'ers Chat, a method for stripping HTML with an option to ignore specific tags:
proc strip-html-ignore {text {ignore {}}} {
set c 0
foreach i $ignore {if {[regexp $i $text]} {return $text}}
return ""
}
proc strip-html {html {ignore {}}} {
regsub -all -- {<[^>]*>} $html "\[strip-html-ignore \[list &\] [list $ignore]\]" html
set html [subst $html]
return $html
}Syntax: strip-html text [list ignore1 ignore2]Example:
set a {<pre><a href=bla>roi<hr></a></pre><br>}
puts [strip-html $a [list <br> <a.*>]]will output:<a href=bla>roi<br>
For big values it will raise error cause of special chars problems and such :) for big strings (like a whole page you fetch with http package) use this:
proc strip-html {html {ignore {}}} {
set m {[][\;\$]}
regsub -all $m $html \\\\& html
foreach i $html {
regsub -all -- {<[^>]*>} $i "\[strip-html-ignore \[list &\] [list $ignore]\]" i
set i [subst $i]
lappend html2 $i
}
return $html2
}[Joe Moss] | [ Category Application } ]

