Updated 2015-09-01 05:35:16 by pooryorick

Purpose: to collect a variety of Tcl idioms that a programmer can use in a manner similar to the way they programmed in awk.

Background:

awk [1] [2] is a tiny language, but much more free, created by Aho, Weinberger, and Kernighan. Awk resembles C in its syntax but provides automatic storage allocation and associative arrays. Awk's most distinctive property among scripting languages are its pattern-action control structure and its automatic parsing of its input. An Awk program consists of one or more pattern-action pairs. Awk parses its input into records, tests each record to see whether it matches the patterns, and whenever it does, executes the associated action. Each record is automatically parsed into fields. By default records are lines and fields are separated by whitespace, but both the record separator and field separator can be set either from the command line or from within the program. In more modern versions of awk the separators can be regular expressions.

Two special patterns, BEGIN and END, are used to allow actions to be executed before any input is read and after all input has been read.

As a result, many useful programs are very short. For example, this is the complete AWK program to print the third field of each line followed by the first, separated by a tab:
 {print $3,$1}

It consists of an action with a null pattern, so the action is executed for every record.

This program prints the last field of records longer than 80 characters:
 length($0) > 80 {print $NF}

Here is more complicated program:
 #! /bin/awk -f
 #       Purpose: to change a double spaced file into a single spaced file

 BEGIN { sw = 0 ; cnt = 0 }

 NF == 0 {
                cnt++
                if (sw == 1) {
                        print $0
                        sw = 0
                } else {
                        sw = 1
                }
                continue
        }

 NF != 0 {
                cnt++
                print $0
                sw = 0
        }

 END { printf "Number of lines = %d\n", cnt }

The file is supplied on the command line invoking the above program and is automatically opened for awk. awk uses a number of special variables:

  • FS is the character used as the field separator.
  • 0 is the current line. awk breaks 0 into fields 1, 2, 3, 4, ... NF , using FS as the split character. The default FS is interesting - it is white space, and two or more white space are collapsed and treated as a single occurrence. However, if you set FS to a specific character, then multiple occurrences of that character creates empty fields. Interestingly enough, in gnu's awk and nawk, you can provide FS a regular expression, so that you can collapse the null fields into a single field.
  • NR is the record number (where the first line input represents NR == 1)
  • NF is the number of fields created after using FS to separate )

RS: Note that 0 is not a variable, but of course a constant. "$0" means "the complete current line". "$1" means "the first field of the current line", so the numbers are rather indexing into an array named "$", with "$0" returning it in full.

Tcl solutions:

Bob Techentin writes on news:comp.lang.tcl:

Since the awk NR is the record number, I assume that you're trying to get specific lines from a file by their line number. The "Tcl way" to do that, for small files that can be read entirely into memory, is to read the data in one fell swoop, then split it into a list, like this:
  set filename "myfile.dat"
  set fp [open $filename "r"]
  set data [split [read $fp [file size $filename]] "\n"]
  close $fp

Then the variable 'data' contains a list of lines. You can get at a specific line by using the list index function:
  set p [lindex $data 122]
  puts "line 122:  '$p'"

If the original data file is very large, then you're stuck reading each line in a loop. If you're planning on matching lines, then look at the Regular Expressions man pages. Tcl 8.x's regular expressions are more powerful than the traditional awk's regular expressions.

owh - a fileless tclsh (named in honor of Ousterhout, Welch, and Hobbs ;-) gives you a similar operation framework (initial, per-line, and final code) which can be specified right on the command line, as is habitual with awk programmers also. But the language is Tcl. Of course.

awksplit in Braintwisters mimicks the line-splitting behavior of awk: given a string, it is broken up on default or specifiable FS into variables 1..NF, where NF gives the number of fields. It even reconstructs $0 if you assign to one of the fields.

Side note: while $1 in Tcl means "the value of the variable named 1" (which is a legal name), in awk it's rather a special syntax for indexing into array "$", so you can write (in awk):
 {for(i=1;i<=NF;i++){print $i}}

One of the neat aspects of this feature is that one can make reference to $NF - which means "use the value of the last field - whatever number it might be". Thus, one can write:
 #! /bin/awk -f
 { print $NF }

This splits the file based on your FS and then prints the last field of each record. One record can have 10 and the next 100 - you still get the last field! Not many other languages allows neat stuff like this.

Tcl of course does (cf. [lindex $list end]).

Don't forget the file scanning commands [3] in TclX, i.e.


Arjen Markus has posted source [4] (also found at http://phaseit.net/claird/comp.lang.tcl/examples/awkproc.tcl ), for a Tcl-coded package which emulates important AWK capabilities.

AM (21 february 2006) Here is another one: Scan and modify text files - it is meant to solve a common problem: modifications that should be limited to certain parts of the text.

This [5] thread makes several points of interest to those coming from Awk.