Is it possible to use the HTTP package to (I guess) send a URL to a site, which can pass some parameters to a cgi script at the site and then capture what the cgi script returns, without going through a web browser? Where might I find sample code to help me write this?Michael A. Cleverly responded: If I understand your question correctly, you are interested in a Tcl script which talks to a foreign web server (dynamically passing variables to a cgi script on that server) and capturing and parsing the output and doing something with it without the use of a web browser. (CL notes that we sometimes call this "Web scraping".)Basically the steps are:
- Figure out what the location is (url) and what inputs (form variables) you need to pass
- Use the http package to fetch the page
- Parse the results and glean whatever data you are after
For this ZIP Code, ZIP Code City Name State the city name is: Type ---------------------------------------------------------------------- LAYTON UT ACCEPTABLE (DEFAULT) STANDARD WEST LAYTON UT NOT ACCEPTABLE- STANDARD USE LAYTONLooking at the HTML source in our browser we see:
<FORM METHOD="POST" ACTION="/cgi-bin/zip4/ctystzip2"> <INPUT SIZE="35" MAXLENGTH="35" NAME="ctystzip" value="84041"> <INPUT TYPE="submit" VALUE="Process">So now we know that we need to POST to http://www.usps.gov/cgi-bin/zip4/ctystzip2 and pass in a form variable of ctystzip with a five digit zipcode. We'll get an HTML page back, and for a valid zipcode there will be a city name(s), a bunch of spaces, a two letter state abbreviation, and then the words "ACCEPTABLE (DEFAULT)".Since we don't care about the handful of exceptions where a zipcode crosses a state border in the middle of nowhere, we just need to query their web server repeatedly and build up our list of 3-digit zipcode/state associations. If we start at XXX00 and increment by one we can stop either when we've found a valid zipcode (and hence the state) or we reach XXX99 (meaning we've found a 3-digit zipcode prefix that hasn't been assigned yet). This way though we'll still have to make a bunch of repeated requests, we won't have to make anywhere near a full 100,000 hits.Once we parse out the data we could save it to a file, write it to standard out, stuff it in a database. Use the standard tcl library and email someone about it. The possibilities are wide open.So, now, here's the code the example code:
#!/usr/local/bin/tclsh package require http # some websites, not the usps necessarily, care what kind of browser is used. ::http::config -useragent "Mozilla/4.75 (X11; U; Linux 2.2.17; i586; Nav)" set url http://www.usps.gov/cgi-bin/zip4/ctystzip2 # We'll work down from 999xx to 000xx since it's more gratifying to # get results immediately. 999 is Alaska, while 000 and 001 aren't # assigned. :-) for {set i 999} {$i >= 0} {incr i -1} { for {set j 0} {$j <= 99} {incr j} { # use format to pad our string appropriately with leading zeros # to come up with a 5-digit zipcode to test set zipcode [format %03d $i][format %02d $j] # the http man page is a good place to read up on these commands set query [::http::formatQuery ctystzip $zipcode] set http [::http::geturl $url -query $query] set html [::http::data $http] # we use a regular expression pattern to extract the text # we are looking for if {[regexp { ([A-Z][A-Z]) +ACCEPTABLE} $html => state]} { puts "[format %03d $i]xx ==> $state" # we found a match, so let's break out of the inner loop break } elseif {$j == 99} { puts "[format %03d $i]xx ==> not found" } } }Running this produces output like:
999xx ==> AK 998xx ==> AK 997xx ==> AK 996xx ==> AK 995xx ==> AK 994xx ==> WA 993xx ==> WA 992xx ==> WA 991xx ==> WA 990xx ==> WA 989xx ==> WA 988xx ==> WA 987xx ==> not found 986xx ==> WAetc.
See also: Finding distances by querying MapQuest