Is it possible to use the HTTP package to (I guess) send a URL to a site, which can pass some parameters to a cgi script at the site and then capture what the cgi script returns, without going through a web browser? Where might I find sample code to help me write this?Michael A. Cleverly responded: If I understand your question correctly, you are interested in a Tcl script which talks to a foreign web server (dynamically passing variables to a cgi script on that server) and capturing and parsing the output and doing something with it without the use of a web browser. (CL notes that we sometimes call this "Web scraping".)Basically the steps are:
- Figure out what the location is (url) and what inputs (form variables) you need to pass
- Use the http package to fetch the page
- Parse the results and glean whatever data you are after
. We click on the link in their navbar for "Find Zip Codes." From there we click on the link for "City/State/ZIP Code Assocations page." We get to a page where we can enter a zipcode. Let's enter a zipcode just to see how it works and what kind of data we'll get. Any zipcode will do. I live in 84041, so I put that in.Now we're at http://www.usps.gov/cgi-bin/zip4/ctystzip2
. There are no form variables in the URL itself (?zipcode=blah,blah,blah type stuff) so they must be using a post method. The result we get are formatted in a fixed width font and look like:
For this ZIP Code, ZIP Code
City Name State the city name is: Type
----------------------------------------------------------------------
LAYTON UT ACCEPTABLE (DEFAULT) STANDARD
WEST LAYTON UT NOT ACCEPTABLE- STANDARD
USE LAYTONLooking at the HTML source in our browser we see:<FORM METHOD="POST" ACTION="/cgi-bin/zip4/ctystzip2"> <INPUT SIZE="35" MAXLENGTH="35" NAME="ctystzip" value="84041"> <INPUT TYPE="submit" VALUE="Process">So now we know that we need to POST to http://www.usps.gov/cgi-bin/zip4/ctystzip2
and pass in a form variable of ctystzip with a five digit zipcode. We'll get an HTML page back, and for a valid zipcode there will be a city name(s), a bunch of spaces, a two letter state abbreviation, and then the words "ACCEPTABLE (DEFAULT)".Since we don't care about the handful of exceptions where a zipcode crosses a state border in the middle of nowhere, we just need to query their web server repeatedly and build up our list of 3-digit zipcode/state associations. If we start at XXX00 and increment by one we can stop either when we've found a valid zipcode (and hence the state) or we reach XXX99 (meaning we've found a 3-digit zipcode prefix that hasn't been assigned yet). This way though we'll still have to make a bunch of repeated requests, we won't have to make anywhere near a full 100,000 hits.Once we parse out the data we could save it to a file, write it to standard out, stuff it in a database. Use the standard tcl library and email someone about it. The possibilities are wide open.So, now, here's the code the example code: #!/usr/local/bin/tclsh
package require http
# some websites, not the usps necessarily, care what kind of browser is used.
::http::config -useragent "Mozilla/4.75 (X11; U; Linux 2.2.17; i586; Nav)"
set url http://www.usps.gov/cgi-bin/zip4/ctystzip2
# We'll work down from 999xx to 000xx since it's more gratifying to
# get results immediately. 999 is Alaska, while 000 and 001 aren't
# assigned. :-)
for {set i 999} {$i >= 0} {incr i -1} {
for {set j 0} {$j <= 99} {incr j} {
# use format to pad our string appropriately with leading zeros
# to come up with a 5-digit zipcode to test
set zipcode [format %03d $i][format %02d $j]
# the http man page is a good place to read up on these commands
set query [::http::formatQuery ctystzip $zipcode]
set http [::http::geturl $url -query $query]
set html [::http::data $http]
# we use a regular expression pattern to extract the text
# we are looking for
if {[regexp { ([A-Z][A-Z]) +ACCEPTABLE} $html => state]} {
puts "[format %03d $i]xx ==> $state"
# we found a match, so let's break out of the inner loop
break
} elseif {$j == 99} {
puts "[format %03d $i]xx ==> not found"
}
}
}Running this produces output like:999xx ==> AK 998xx ==> AK 997xx ==> AK 996xx ==> AK 995xx ==> AK 994xx ==> WA 993xx ==> WA 992xx ==> WA 991xx ==> WA 990xx ==> WA 989xx ==> WA 988xx ==> WA 987xx ==> not found 986xx ==> WAetc.
See also: Finding distances by querying MapQuest

