Summary edit
Arjen Markus (18 february 2003) I wanted a good, concise script that shows what Tcl is good at. That is, a script that performs tasks that are difficult or awkward in system programming languages or require lots of code.The script below uses the following techniques:- Glob-style string matching
- Regular expressions
- Interaction with the file system
- Graphical user-interface in just a handful of lines of code
- Reading from files without having to worry about the length of strings etc.
- Look for files matching the given pattern
- Read each line and see if they match the textual pattern
- If so, display the line and the (first) matching part in the main window
Program 1 - agrep edit
# agrep.tcl -- # Script to emulate the UNIX grep command with a small GUI # # createWindow -- # Create the main window # # Arguments: # None # Result: # None # Side effects: # Controls added to main window # proc createWindow {} { global filemask global pattern global ignore_case # # Menubar (simple) # frame .menubar -relief raised -borderwidth 1 pack .menubar -side top -fill x menubutton .menubar.file -text File -menu .menubar.file.menu menu .menubar.file.menu -tearoff false .menubar.file.menu add command -label Exit -command exit pack .menubar.file -side left # # Fill in fields # frame .f1 label .f1.empty -text " " label .f1.mask_label -text "Files:" -justify left label .f1.patt_label -text "Regular expression:" -justify left entry .f1.filemask -textvariable filemask entry .f1.pattern -textvariable pattern checkbutton .f1.ignore_case -variable ignore_case -text "Ignore case" button .f1.search -command searchFiles -text "Search" grid .f1.empty x x grid .f1.mask_label .f1.filemask .f1.search -sticky w grid .f1.patt_label .f1.pattern .f1.ignore_case -sticky w pack .f1 -side top -fill x # # Result window # frame .f2 text .f2.text -font "Courier 10" \ -yscrollcommand {.f2.y set} \ -xscrollcommand {.f2.x set} scrollbar .f2.x -command {.f2.text xview} -orient horizontal scrollbar .f2.y -command {.f2.text yview} grid .f2.text .f2.y -sticky ns grid .f2.x x -sticky we pack .f2 -side top # # Just for the fun of it: define the styles for the "matched", # "error" and "fn" tags # .f2.text tag configure "matched" -underline 1 -background yellow .f2.text tag configure "fn" -underline 1 -background lightblue .f2.text tag configure "error" -background red } # searchFiles -- # Search for files in the current directory that match the given # mask # # Arguments: # None # Result: # None # Side effects: # Calls "searchPattern" to fill the result window # proc searchFiles {} { global filemask global pattern global ignore_case # # Clear the result window, then get a list of files # .f2.text delete 0.1 end if { $filemask == "" } { set filemask "*" } foreach file [glob $filemask] { if { [file isdirectory $file] } { continue ;# Might become a recursive descent later :) } else { searchPattern $file $pattern $ignore_case } } } # searchPattern -- # Search for lines containing the given pattern in a file # # Arguments: # filename Name of the file to be searched # pattern Given regular expression # ignore_case Ignore the case or not # Result: # None # Side effects: # Fills the result window # proc searchPattern {filename pattern ignore_case} { if { [ catch { set infile [open $filename "r"] .f2.text insert end "$filename:\n" fn while { [gets $infile line] >= 0 } { if { $ignore_case } { set match [regexp -nocase -indices -- $pattern $line indices] } else { set match [regexp -indices -- $pattern $line indices] } if { $match } { set first [lindex $indices 0] set last [lindex $indices 1] .f2.text insert end [string range $line 0 [expr {$first-1}]] .f2.text insert end [string range $line $first $last] "matched" .f2.text insert end [string range $line [expr {$last+1}] end] .f2.text insert end "\n" } } close $infile } msg ] } { .f2.text insert end "$msg\n" } .f2.text insert end "\n" } # main -- # Main code to get it all going # global filemask global pattern global ignore_case set filemask "*" set pattern {} set ignore_case 1 createWindow
Program 2 edit
TV I'm not sure it is superfluous, but bwise for a long time had this one packed with it:proc grep { {a} {fs {*}} } { set o {} foreach n [lsort -incr -dict [glob $fs]] { set f [open $n r] set c 0 set new 1 while {[eof $f] == 0} { set l [gets $f] incr c if {[string first $a $l] > -1} { if {$new == 1} {set new 0; append o "*** $n:" \n} append o "$c:$l" \n } } close $f } return $o }
# Test with: catch {console show} puts "Result:\n[grep "require" "*.tcl"]"
The variable which contains the return value is formatted to be OK in a shell or console, but can easily be computer formatted too. Efficiency was fine on older PC's. It errs on subdirs when they match the search pattern I just came up with, but it is small.
This application would be great if it could scan large files (> 20 Mb) at a reasonable speed. Any ideas to achieve that?AM I have never tried it on "large" files. What is the performance? Is it simply that there are a lot of hits (so that the display gets filled with a lot of lines)? Are your regular expressions complicated? (Hm, perhaps a small experiment ...)I ran it on a directory with a dozen files. One of them has 23 Mb and the other one 123 Mb. The others are below 10 Kb. The "grep-like utility" freezes. After 5 minutes, I give up and kill it. A search with pure grep takes less than 10 seconds. Then I removed the 123-Mb file. The search then took 55 seconds, still slow. Few matches and a very simple "letters-only" regex. Speed has never been Tcl's strongest suit. Or maybe this particular application's code is not efficient.AM Experiment:
- File of 11 MB, 164214 lines
- simple pattern "time"
- 2826 lines containing that word
- 30 seconds to read the file and display the result
- 14 seconds to read the file and not display the result
- 5 seconds to just read the file
schlenk:
- don't call .f2.text insert end $something so often
set text [list [string range $line 0 [expr {$first-1}]] {} \ [string range $line $first $last] "matched" \ "[string range $line [expr {$last+1}] end]\n" {} ] eval [linsert $text 0 .f2.text insert end]
- fconfigure the channel with a large buffer
- im not sure if blocks of lines instead of single lines with regexp -line helps speed up things:
regexp -all -line -indices ...
foreach {first last} $indices {break}
MS wrote something up, but did not get around to test it. Consider maybe as "idea" rather than "code".NOTE: if there are several matches in a single line, the original would highlight the first occurrence; this one highlights the last one instead. Should be corrected to highlight them all?AM I did the experiment: with the original script I get some 13 seconds and with Miguel's version I get the same -- no obvious improvement. But: this is with Tcl 8.3.
Program 3 edit
# ATTENTION: requires inserting the following line in # the calling proc [searchFiles], before the loop that # iterates over all files: # set pattern (.*)($pattern)(.*) # so that regexp will store the start-match-end # parts of each line. # The pattern above will highlight the last occurrence of # the requested pattern in a line; in order to get the # first, replace with # set pattern (.*?)($pattern)(.*)$ # This should be done in the # calling proc so that the compiled regexp need not be # recompiled for each file. # # - NOT TESTED - # # Improvements proposed here: # 1. make sure the input is buffered # 2. slurp in the file in larger chunks # 3. let [regexp] do the substring extraction # 4. coalesce text insertion (from above) proc searchPattern {filename pattern ignore_case} { set slurpSize 2000 if { [ catch { set infile [open $filename "r"] fconfigure $infile -buffering full .f2.text insert end "$filename:\n" fn set chunk {} while 1 { append chunk [read $infile $slurpSize] if {![string length $chunk]} { break } set lines [split $chunk "\n"] set chunk [lindex $lines end] set oldTail {} set res [list .f2.text insert end] foreach line [lrange $lines 0 end-1] { # # Use -inline instead of match variables, to insure # fastest access to these variables: [regexp] does not # need to access the variables at all, this proc accesses # them by index (no lookup). # # An alternative in Tcl8.4+, slightly slower though, # is to insure that the match vars are seen as local vars # by the compiler (I am actually not sure if this is really # needed, should dive into the compiler for [regexp]). To do # that, if you use the match vars {-> start match tail}, you # could insert at the top (outside the loop) # foreach {-> start match fail} {} break # In this case, [regexp] still looks up the variables by # name but this proc body accesses the variables by index. # (NOTE: the previous sentence was edited, it was wrong in its # first version) # if { $ignore_case } { set matches [regexp -inline -nocase -- $pattern $line] } else { set matches [regexp -inline -- $pattern $line] } if {[llength $matches]} { foreach {-> start match tail} $matches break lappend res $oldTail$start {} $match "matched" set oldTail $tail\n } } if {[llength $res]} { eval [lappend res $oldTail {}] } } close $infile } msg ] } { .f2.text insert end "$msg\n" } .f2.text insert end "\n" }
AM (9 may 2005) Another thought: could it be that as used above, [regexp] re-compiles the pattern with each invocation? Complicated patterns could then require a lot of time to achieve this. Then it would be more efficient to change the procedure's body so that a constant pattern is seen by [regexp] .... I have not tested this idea yet.
[Ross Cartlidge] (19 October 2005) egrep is always going to be faster than any tcl/perl/python. So use it as a filter and thus only process the matched lines. Use this version of searchPattern
Program 4 edit
proc searchPattern {filename pattern ignore_case} { if {[catch { close [open $filename] .f2.text insert end "$filename:\n" fn set matches {} catch { if {$ignore_case} { set matches [exec egrep -i $pattern $filename] } else { set matches [exec egrep $pattern $filename] } } foreach line [split $matches "\n"] { if {$ignore_case} { set match [regexp -nocase -indices -- $pattern $line indices] } else { set match [regexp -indices -- $pattern $line indices] } if { $match } { set first [lindex $indices 0] set last [lindex $indices 1] .f2.text insert end [string range $line 0 [expr {$first-1}]] .f2.text insert end [string range $line $first $last] "matched" .f2.text insert end [string range $line [expr {$last+1}] end] .f2.text insert end "\n" } } } msg ] } { .f2.text insert end "$msg\n" } .f2.text insert end "\n" }
Sarnold asks why not call string match or string first on each line before applying a regular expression?Here is an attempt to make it faster (regexps are constants as they are stored in a proc):
Program 5 edit
proc searchPattern {filename pattern ignore_case} { catch {rename matches ""} catch {rename globmatches ""} set pattern2 *${pattern}* set pattern (.*)(${pattern})(.*) } if {$ignore_case} { set body {regexp -inline -nocase --} } else { set body {regexp -inline --} } proc matches line [string map {%LINE% $line} [linsert $body end $pattern %LINE%]] if {$ignore_case} { set body {string match -nocase} } else { set body {string match} } proc globmatches line [string map {%LINE% $line} [linsert $body end $pattern2 %LINE%]] set slurpSize 2000 if { [ catch { set infile [open $filename "r"] fconfigure $infile -buffering full .f2.text insert end "$filename:\n" fn set chunk {} while {![eof $infile]} { append chunk [read $infile $slurpSize] if {![string length $chunk]} { break } set lines [split $chunk "\n"] set chunk [lindex $lines end] foreach line [lrange $lines 0 end-1] { # this test checks $line for a glob-style matching (via string match) # you might drop it if you find it limiting pattern matching if {[globmatches $line]} { if {[llength [set matches [matches $line]]]} { foreach {-> start matched tail} $matches break .f2.text insert end $start .f2.text insert end $matched "matched" .f2.text insert end $tail .f2.text insert end "\n" } } } } close $infile } msg ] } { .f2.text insert end "$msg\n" } .f2.text insert end "\n" }
Comments edit
...See also: