Usage edit
- First you have to get the above code into a file somehow. You have to start somewhere ;-) . So somehow save this page into a file called "wiki-reaper", and edit the contents to remove comments, etc.
- Make certain that the file is going to be found when you attempt to run it. On Unix like systems, that involves putting the file into one of the directories in $PATH.
- wiki-reaper 4718 causes wiki-reaper to fetch itself... :-)
Previous Revisions edit
- edit 34, replaced 2014-12-30
- the original code by Michael A. Cleverly
Code edit
#!/usr/bin/env tclsh package require Tcl 8.5 package require cmdline package require http package require textutil namespace eval wiki-reaper { variable version {4.0.0 (2017-11-17)} variable protocol https variable useCurl 0 variable hostname wiki.tcl-lang.org set curl [expr {![catch {exec curl --version}]}] set tls [expr {![catch { package require tls http::register https 443 [list ::tls::socket -tls1 1] }]}] if {!$tls} { if {$curl} { set useCurl 1 } else { set protocol http } } unset curl tls } proc ::wiki-reaper::output args { set ch stdout switch -exact -- [llength $args] { 1 { lassign $args data } 2 { lassign $args ch data} default { error {wrong # args: should be "output ?channelId? string"} } } # Don't throw an error if $ch is closed halfway through. catch { puts $ch $data } } proc ::wiki-reaper::fetch url { variable useCurl # The cookie is necessary when you want to retrieve page history. set cookie wikit_e=wiki-reaper if {$useCurl} { set data [exec curl -s -f -b $cookie $url] } else { set connection [::http::geturl $url -headers [list Cookie $cookie]] set data [::http::data $connection] ::http::cleanup $connection } return $data } proc ::wiki-reaper::parse-history-page {html pattern} { set needle [join [list \ {<td class=["']Rev["']><a href=["'][^"']+["'] rel=["']nofollow["']>} \ $pattern \ {</a></td><td class=["']Date["']>([0-9\-\: ]+)</td>} \ ] {}] set success [regexp -nocase $needle $html _ revision date] if {!$success} { error "couldn't parse the revision or the date." } return [list $revision $date] } proc ::wiki-reaper::reap {page block {revision "latest"} {flags ""}} { variable protocol variable hostname set latestHistoryUrl "$protocol://${hostname}/_/history?N=\$page&S=0&L=1" set revisionHistoryUrl "protocol://${hostname}/_/history?N=\$page&S=\[expr \ {\$latestRevision - \$revision}\]&L=1" set pageUrl "$protocol://${hostname}/_/revision?N=\$page&V=\$revision" set codeUrl "$protocol://${hostname}/_/revision?N=\$page.code&V=\$revision" set now [clock format [clock seconds] -format "%Y-%m-%d %H:%M:%S" -gmt 1] if {$revision eq ""} { set revision "latest" } set latestHistory [fetch [subst $latestHistoryUrl]] if {![regexp -nocase \ {<title>Change history of ([^<]*)</title>} \ $latestHistory _ title]} { error "couldn't parse the document title." } lassign [parse-history-page $latestHistory {([0-9]+)}] \ latestRevision latestUpdated if {$revision eq "latest"} { set revision $latestRevision set updated $latestUpdated } else { if {![string is integer -strict $revision] || ($revision < 0) || ($revision > $latestRevision)} { error "no revision $revision ($latestRevision latest)" } set revHistory [fetch [subst $revisionHistoryUrl]] lassign [parse-history-page $revHistory "($revision)"] revision updated } set code [fetch [subst $codeUrl]] if {[regexp -nocase "<body>.?<h2>[subst $codeUrl] Not Found</h2>" \ $code _]} { error "wiki page $page does not exist" } if {$block ne ""} { set codeBlocks [::textutil::splitx $code \ {\n#+ <code_block id=[0-9]+ title='[^>]*?'> #*\n}] set code [lindex $codeBlocks [expr {$block + 1}]] } if {[dict get $flags "x"]} { output "#! /usr/bin/env tclsh" } output "#####" output "#" output "# \"$title\" ([subst $pageUrl])" if {$block ne ""} { output "# Code block $block" } output "#" output "# Wiki page revision $revision, updated: $updated GMT" output "# Tcl code harvested on: $now GMT" output "#" output "#####" output $code output "# EOF" } proc ::wiki-reaper::main {argv} { variable protocol variable version set options { {f "Allow downloading over HTTP instead of HTTPS"} {x "Output '#!/usr/bin/env tclsh' as the first line"} {v "Print version and exit"} } set usage "?options? page ?codeBlock? ?revision?" if {$argv in {-h -help --help -?}} { output stderr [::cmdline::usage $options $usage] exit 0 } if {[catch { set flags [::cmdline::getoptions argv $options $usage] } err]} { output stderr $err exit 1 } if {$protocol eq "http"} { if {[dict get $flags "f"]} { output stderr {Warning! Can't use cURL or TclTLS; connecting over\ insecure HTTP.} } else { output stderr {Can't use cURL or TclTLS; refusing to connect over\ insecure HTTP without the "-f" flag.} exit 1 } } lassign $argv page block revision if {[dict get $flags "v"]} { output $version exit 0 } if {$page eq ""} { output stderr [::cmdline::usage $options $usage] exit 0 } reap $page $block $revision $flags } proc ::wiki-reaper::main-script? {} { # From https://tcl.wiki/40097. global argv0 if {[info exists argv0] && [file exists [info script]] && [file exists $argv0]} { file stat $argv0 argv0Info file stat [info script] scriptInfo expr {$argv0Info(dev) == $scriptInfo(dev) && $argv0Info(ino) == $scriptInfo(ino)} } else { return 0 } } if {[::wiki-reaper::main-script?]} { ::wiki-reaper::main $argv } #EOF
Bootstrapping
You wouldn't want to copy-paste this to file manually, now would you?fr Why not? lam suggested a tiny javascript that allows copy and paste - click on code to select. Please check out this page cloned at wiki-reaperOn *nix
Here's a one-liner to save the wiki-reaper script to wiki-reaper.tcl in the current directory. It uses awk and a special marker in the code to make sure other code blocks on this wiki page don't interfere.curl https://wiki.tcl-lang.org/4718.code | awk 'BEGIN{write=0} /env tclsh/{if(write==0){write=1}} /#EOF/{write=-1} {if(write==1){print $0}}' > wiki-reaper.tcl && chmod 0755 wiki-reaper.tclIf you want to install wiki-reaper to /usr/local/bin on your machine instead runcurl https://wiki.tcl-lang.org/4718.code | awk 'BEGIN{write=0} /env tclsh/{if(write==0){write=1}} /#EOF/{write=-1} {if(write==1){print $0}}' | sudo tee /usr/local/bin/wiki-reaper && sudo chmod +x /usr/local/bin/wiki-reaperOn Windows
TBDSecurity edit
If it can't find cURL on your system, wiki-reaper downloads code from the wiki over plain HTTP that is vulnerable to a [MITM] attack, meaning a hostile network node can replace the code you want with something malicious. Moreover, anyone can edit the wiki, so the code may change between when you look it and when you download it. Be sure to inspect the code you fetch with wiki-reaper before you run it.Discussion edit
jcw 2002-11-22:This could be the start of something more, maybe...I've been thinking about how to make the wiki work for securely re-usable snippets of script code. Right now, doing a copy-and-paste is tedious (the above solves that), but also risky: what if someone decides to play tricks and hide some nasty chage in there. That prospect is enough to make it quite tricky to re-use any substantial pieces, other than after careful review - or simply as inspiration for re-writing things.Can we do better? Maybe we could. What if a "wiki snippet repository" were added to this site - here's a bit of thinking-out-loud:- if verbatim text (text in <pre>...</pre> form) starts off with a certain marker, it gets recognized as being a "snippet"
- snippets are stored in a separate read-only area, and remain forever accessible, even if the page changes subsequently
- the main trick is that snippets get stored on basis of their MD5 sum
- each snippet also includes: the wiki page#, the IP of the submitter, timestamp, and a tag
- the tag is extracted from the special marker that introduces a snippet, it's a "name" for the snippet, to help deal with multiple snippets on a page
- if you have an MD5, you can retrieve a snippet, without risk of it being tampered with, by an url, say http://mini.net/wikisnippet/<this-is-the-32-character-md5-in-hex>
- the IP stored with it is the IP of the person making the change, and creating the snippet in the first place, so it is a reliable indicator of the source of the snippet
- if you edit a page and don't touch snippet contents, nothing happens to them
- if you do alter one, it gets a new MD5 and other info, and gets stored as a new snippet
- if you delete one, it stops being on the page, but the old one is retrievable as before
SB 2002-11-23: If you for a minute forget about the validation of code integrity and think about the possibility to modify program code independent of location, then it sounds like a very good idea. An example is to show progress of coding. The start is a very simple example code, then the example is slightly modified to show how the program can be improved. With this scheme, every improvement of code can be backtracked to the very beginning, and, hence, work as a tutorial for new programmers. If we then think about trust again, there are too many options for code fraud that I do not know.
escargo 2002-11-23: I have to point out that the IP address of the source is subject to a bunch of qualifications. Leaving out the possibility of the IP address being spoofed, I get different IP addresses because of the different locations I use to connect to the wiki; with subnet masking it's entirely possible that my IP addresses could look very different at different times even when I am connected from the same system.Aside from that issue, could such a scheme be made to work well with a version of the unknown proc and mounting the wiki, or part of the wiki, through VFS? This gets back to the TIP dealing with metadata for a repository.This in turn leads me to wonder, how much of a change would it be to add a page template capability to the wiki? In practice now, when we create a new page, it is always the same kind of page. What if there was a policy change that allowed for creating each new page selected from a specific set of types of pages. The new snippet page would be one of those types. Each new page would have metadata associated with it. Instead of editing pages always in a text box, maybe there would be a generated form. Is that possible? How hard would it be? This could lead from a pure wiki to a web-based application, but I don't know if that is a bad thing or not. Just a thought. (Tidied up 5 May 2003 by escargo.)
LV 2003-05-05: with regards to the snippet ideas above, I wonder if, with the addition of CSS support here on the wiki, some sort of specialized marking would not only enable snipping code, but would also enable some sort of special display as well - perhaps color coding to distinguish proc names from variables from data from comments, etc.CJU 2004-03-07: In order to do that, you would need to add quite a bit of extra markup to the HTML. I once saw somewhere that one of the unwritten "rules" of wikit development was that preformatted text should always be rendered untouched from the original wiki source (with the exception of links for URLs). I don't particularly agree with it, but as long as it's there, I'm not inclined to believe that the developer(s) are willing to change.Now, straying away from your comment for a bit, I would rather have each preformatted text block contain a link to the plaintext within that block. This reaping is an entertaining exercise, but it's really just a work-around for the fact that getting just the code out of an HTML page is inconvenient for some people. I came to this conclusion when I saw a person suggest that all reapable pages on the wiki should have hidden markup so that the reaper could recognize whether the page was reapable or not. To me, it's a big red flag when you're talking about manually editing hundreds or thousands of pages to get capability that should be more or less automatic.I'm looking at toying around with wikit in the near future, so I'll add this to my list of planned hacks.
LV 2007-10-08:Well, I changed the mini.net reference to tcl.wiki. But there is a bug that results in punctuation being encoded. I don't know why that wasn't a problem before. But I changed one string map into a call to ::htmlparse::mapEscapes to take care of of the problem.
tb 2009-06-16Hm... - I still get unmapped escape sequences, when reaping from this page, using kbskit-8.6. I don't get them, when reaping from a running wikit. Am I missing something?
LV 2009-06-17 07:37:08:Is anyone still using this program? Do any of the wiki's enhancements from the past year or two provide a way to make this type of program easier?
jdc 2009-06-17 08:34:23:Fetching <pagenumber>.txt will get you the Wiki markup. Best start from there when you want to parse the wiki pages yourself. Another option is to fetch <pagenumber>.code to only get the code blocks. Or use TWiG.
dbohdan 2014-12-30: The code could not retrieve anything when I tried it, so I updated it to work with the wiki as it is today. It now uses <pagenumber>.code, which jdc has mentioned above. Other changes:
- Do not use nstcl-http or exec magic.
- Include page URL in the output.
- Format date of retrieval consistently with dates on the wiki.
- Can retrieve a specific code block from the page if you tell it to. The usage is now wiki-reaper page ?codeBlock?.
- Can no longer fetch more than one page at a time due to the above.
See also edit
- wiki-runner
- TWiG
- Fetch <page URL>.txt to get the wiki markup instead of the HTML or <page URL>.code for just the code blocks.