JM 4 Dec 2012 - Here is a minimal example of
Web scraping using
htmlparseAs I am a
RS fan, I am getting a list of all his recent projects.
- This is an unfinished code just to show the overall mechanism.
- notice that I am getting just one link per bullet, so, for example, I am missing the link for A pocket Wiki, which is the second link on the 5th bullet. see how ONLY Profiling with execution traces is being listed.
- also, notice the error message "node "" does not exist in tree "t"" when there is no link on the bullet, as in "simplicite"
getting as many links per bullet could be a good exercise for the reader.
As a side note, I used
LemonTree branch to easily find the location of the bulleted list block that I am parsing.
Ways of accessing the data edit
Walking the tree
package require struct
package require htmlparse
package require http
namespace eval ::scraper {
# The tag at $startNodePath should be a <ul> with its children having the
# structure of <li><a href="...">...</a><li>.
proc parse-list-of-links {url startNodePath} {
set documentTree [::struct::tree]
set conn [::http::geturl $url]
set html [::http::data $conn]
::http::cleanup $conn
htmlparse::2tree $html $documentTree
htmlparse::removeVisualFluff $documentTree
htmlparse::removeFormDefs $documentTree
set base [walk $documentTree $startNodePath]
puts "data: [$documentTree get $base data]"
puts "type(tag): [$documentTree get $base type]\n"
# Start with the first child of the base tag.
set li [walkf $documentTree $base {0}]
while {$li ne ""} {
set link [$documentTree get [walkf $documentTree $li {0}] data]
catch {
$documentTree get [walkf $documentTree $li {0 0}] data
} title
puts "$link: $title"
# Go from the current li to its sibling node.
set li [$documentTree next $li]
}
$documentTree destroy
return
}
proc walkf {tree startNode path} {
set node $startNode
foreach idx $path {
if {$node eq ""} {
break
}
set node [lindex [$tree children $node] $idx]
}
return $node
}
proc walk {tree path} {
return [walkf $tree root $path]
}
}
::scraper::parse-list-of-links "http://wiki.tcl.tk/1683" {1 15 0}
dbohdan 2015-01-11: I found the example code above hard to understand, so I updated it with some comments as well as variable and proc names that I think clarify what the script does at each step. JM, I hope you don't mind my changes.
JM 2015-01-14: Of course not, this is much better, thanks!
TreeQL
dbohdan 2015-01-11: The following script scrapes the same data as the one above but processes multiple links in each list item, not just the first one. This is done using
TreeQL queries with which manipulating every child node of a given node comes naturally.
package require struct
package require fileutil
package require htmlparse
package require http
package require treeql 1.3
proc parse-treeql {url} {
set documentTree [::struct::tree]
set conn [::http::geturl $url]
set html [::http::data $conn]
::http::cleanup $conn
htmlparse::2tree $html $documentTree
treeql q1 -tree $documentTree
treeql q2 -tree $documentTree
q1 query tree withatt type ul
set ul [lindex [q1 result] 2]
q1 query replace $ul children children map x {
# For each li in the ul...
q2 query replace $x get data
set link [lindex [q2 result] 0]
q2 query replace $x children get data
set title [lindex [q2 result] 0]
if {$title ne ""} {
puts "$link: $title"
}
}
q1 discard
q2 discard
$documentTree destroy
return
}
parse-treeql "http://wiki.tcl.tk/1683"
Selectors
With
treeselect you can use CSS selector-like queries to access the elements of an HTML document stored in a tree object.
To run this example you will need a copy of the treeselect module in the same directory. You can download it with
wiki-reaper:
wiki-reaper 41023 0 10 > treeselect-0.3.2.tm.
::tcl::tm::path add .
package require treeselect 0.3
set tree [::treeselect::url-to-tree "http://wiki.tcl.tk/1683"]
set anchorNodes [::treeselect::query $tree {
hmstart html body .container #wrapper div#content
p:nth-child(10) ul li a
}]
foreach node $anchorNodes {
set link [$tree get $node data]
set title [$tree get \
[::treeselect::query $tree "PCDATA" $node] data]
puts "$link: $title"
}
Related links:
http://core.tcl-lang.org/tcllib/doc/trunk/embedded/www/tcllib/files/modules/struct/struct_tree.html