Updated 2013-01-20 18:36:08 by pooryorick

A (and probably the biggest) internet search engine, reachable e.g. at http://google.com/ .

See Fuzzy Google truth for an oracle that filters out how many pages Google found for a query.

NEM 23nov2002 - Can someone explain the impact of the following snippet from the Google terms of service[1] on such Tcl-powered use of Google?

"No Automated Querying

You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that 'sending automated queries' includes, among other things:

  • using any software which sends queries to Google to determine how a website or webpage 'ranks' on Google for various queries;
  • 'meta-searching' Google; and
  • performing 'offline' searches on Google."

What part do you wish explained? What the words mean? Why they would do something like this? Why people ignore it?

NEM Gosh, I hadn't seen this reply (it's now August 2004). I was really wondering whether the term "automated query" included some of the stuff which was on this page, and to make sure that people were aware of the terms of service. I assume the official Google API is the preferred way to go about things these days.

LV I think so too - if they allowed people to do any kind of query in any sort of way, then they would face the 'wrath of Khan when they changed the way the search engine worked. By insisting, from early on, that people use published APIs, then they have a buffer that allows them to make back-end changes.

This limitation also serves as a warning precursor to a lawyer contact for someone making use of google in a competitive manner.

While testing my browser package [2], I discovered that Google rejects any requests from the Tcl HTTP package, unless you alter the User-Agent string.

PT 13-May-2004: The following script will set the http package useragent string to something useful everywhere. I think the http package should probably use this automatically to avoid causing people unnecessary problems.
proc SetUseragent {{app {}}} {
    global tcl_platform
    set ua "Mozilla/4.0 ([string totitle $tcl_platform(platform)];\
        $tcl_platform(os)) http/[package provide http]"
    if {[string length $app] > 0} {
        append ua " " $app
    } else {
        append ua " Tcl/[package provide Tcl]"
    }
    http::config -useragent $ua
}

Produces something like: Mozilla/4.0 (Windows; Windows NT) http/2.4.5 Tcl/8.4

NEM 17-June-2005: I'm not sure this is a good idea. If every HTTP request has the exact same user-agent string (or similar), then it makes it rather pointless. Sites which serve different content based on user-agent are broken, and I'm not sure we should bake-in a work-around into the HTTP package. I think the current setup is correct, identifying the Tcl http package directly. Individual apps or users can easily apply this hack.

For examples of use of the "Google Web API" with Tcl, see [3]. LES: Forget this link. The Intel site is so incompetent that won't let you in with Mozilla Firefox without nagging you about "upgrading" (ahem) to one of the browsers they choose to like. I keep Internet Explorer around for that kind of silly problem and the content mentioned above doesn't seem to be there anymore.

Active discussion of the API is available through google.* netnews [4], as well as [5].

[Anyone want to add info here about code to search google's comp.lang.tcl* newsgroups?]

CL daily uses something like
set keywords "DDE+Word"
set URL http://www.deja.com/\[ST_rn=ps]/qs.xp?ST=PS&svcclass=dnyr&firstsearch=yes&preserve=1&QRY=$keywords&defaultOp=AND&DBS=1&OP=dnquery.xp&LNG=english&subjects=&groups=comp.lang.tcl&authors=&fromdate=&todate=&showsort=score&maxhits=25
...

LV notes that references to deja.com are, in 2010 at least, being redirected to google groups (I believe that many years ago Google absorbed deja.com).

[Anyone want to add info here about code to convert usenet message-id strings into google URLs?]

Use http://groups.google.com/groups?as_umsgid=$message_id (replacing $message_id with the usenet message-id of the post in question).

Or use http://groups.google.com/groups?selm=$message_id (select message).

Queries based on message-id are most useful for providing a compact url to refer to a particular message that you may have found in a search. The easiest (?) way to get that url is to click on "Original Format" for the particular message. Then copy and paste the url of that page, omitting the "&output=gplain" portion.

Bob Techentin, moreover, writes "I usually start at the advanced group search (http://groups.google.com/advanced_group_search), enter some terms and search. To get the thread URL, click on the 'View Thread' link and you'll get a frame on the left with the thread view, and a lot of messages on the right. Click the 'No Frame' link, and you'll get a URL with the th=xxx that you can hand edit down to a minimal link."

If you check http://www.google.com/apis/api_faq.html , you will see that there is a WSDL definition for google, allowing an application using SOAP to access the host. Tcl examples appear in various places, including the intel.com publication above. Notice reliance on the Web API eliminates needs for web scraping.

Pat Thoyts, on the chat, writes: Someone has done a tclsoap wrapper for the google API. It is here http://gondolin.hist.liv.ac.uk/~cheshire/tclgoogle.html .

And here is another at Googling with SOAP.

05Aug03 - The Tclers Wiki is gone from Google - but it is back again now

See also:


Google is big. Very big. It has lots of parts. It has enthusiasts who like to watch and comment on the parts, as, for example, in "Google's Cheat Sheet Reveals A Couple Hidden Syntaxes" [6].

LV June 16, 2005 - I've noticed over the past week or two an increasing number of groups.google.com links that no longer display discussions. Has some fundamental change - perhaps related to the beta google groups2 - taken place? LV Interesting - most of the URLs I've noticed share one argument - th= . I did get an email from a google employee who recommended urls using the message id as a search argument. The best url to use for groups is probably the message-id search, ie:
 http://groups.google.com/groups?selm=message-foo@bar.com

Google has a 'code search service. Check [7] for an example of using it to search the tcl code known that uses a package require...

Google is now promoting a language called Go - see articles such as [8] for background concerning features that some developers consider unique.

Article on GoogleCL command line tool [9]