- Read the data into one or more lists
- Extract the data for a much shorter period of time
- Determine the minimum and maximum values
- Make a quick and dirty plot
- Write some data in a suitable format to another file
- ...
- I want to plot the data for station A and station B in the same plot. This could be done by code like:
set plot [create-a-plot] $plot add data_a $plot add data_b
- I want to determine daily averages of the temperature:
set averages [average [group $temp 24]]
- I want to compute the total amount of nitrogen (present in the chemical forms "Kjeldahl-N" and "nitrate" in the water body):
set totn [sum $kjdn $nitrate]
- A more ambitious form: all the data are stored in the same table containing ammonia, kjeldahl-N, nitrate, ortho-phosphate and total-P, some typical water quality parameters:
set total_nutrients [construct $table totn {$kdn+$nitrate} totp {$totp}]
- Give a statistical summary of the data in the table:
describe $table
- Select a certain range of data (retain every 10th record):
set new_table [slice $table 0 1000 10]Ambitious? Perhaps.The main problem is not the structure in which to keep the data, the main problem is the collection of commands! I can think of at least four different underlying structures:
- As a list or a list of lists - simple and straightforward
- As a matrix (from the Tcllib matrix module) - provides a flexible API for manipulating rows and columns
- As binary arrays - compact, easy to use by a compiled extension (vkit actually uses this approach)
- As tables in a database - Metakit could serve here very very well
- [dataset {list of names}] creates an empty table with columns that can be addressed by name
- [sum $data1 $data2] returns a new dataset whose columns consist of the sums of the two datasets (the second may also be a plain list). The number of columns and rows must match
- [filter $data1 {condition}] returns a new dataset of which the rows have been filtered through the condition
- [slice $data1 $start $stop ?step?] returns a new dataset which has only the rows that are requested
- [contents $data] returns summary information about the dataset (printable)
- [setnames data {list of names}] set the names of the columns of a dataset
- [getnames $data] returns the names of the columns of a dataset
- [getrow $data $row] returns a list with the values at the given row
- [addrow data $values] adds a new row of data at the end
- [print data] prints the contents of the dataset to screen
}
# Create a proper namespace for the data manipulations and # declare the public data # namespace eval ::dml { namespace export dataset sum filter slice contents setnames namespace export getnames getrow addrow print # Private namespace - for filter and others namespace eval v {} } # dataset -- # Create a new empty data set # Arguments: # names List of names # Result: # A new dataset # proc ::dml::dataset {names} { return [list $names {}] } # contents -- # Return a readable description of the dataset # Arguments: # dataset The dataset in question # Result: # String describing the dataset # proc ::dml::contents {dataset} { set string "Columns in the dataset: [join [getnames $dataset] ", "]\n\ Number of rows: [llength [lindex $dataset 1]]" return $string } # getnames -- # Get the column names of an existing dataset # Arguments: # dataset The dataset to be examined # Result: # List of column names # proc ::dml::getnames {dataset} { return [lindex $dataset 0] } # setnames -- # Set the column names of an existing dataset # Arguments: # dataset The dataset to be examined # newnames The new names for the columns - number must match # Result: # None # proc ::dml::setnames {dataset newnames} { upvar $dataset theset set names [lindex $theset 0] if { [[length $names] != [[length $newnames] } { return -code error "Number of names does not match the number of columns" } lset theset 0 $newnames } # addrow -- # Add a row of data to an existing dataset # Arguments: # dataset Name of the dataset # values The row to be added # Result: # None # Note: # The number of values must match the number of columns # proc ::dml::addrow {dataset values} { upvar $dataset theset set names [lindex $theset 0] if { [llength $names] != [llength $values] } { return -code error "Number of values does not match the number of columns" } set data [lindex $theset 1] lappend data $values lset theset 1 $data } # getrow -- # Get a row of data to an existing dataset # Arguments: # dataset The dataset to be examined # row The index of the row to be returned # Result: # None # Note: # The number of values must match the number of columns # proc ::dml::getrow {dataset row} { return [lindex [lindex $dataset 1] $row] } # slice -- # Select the rows of a new dataset by stepping through an existing # Arguments: # dataset Name of the dataset # start The first row to be added # stop The last row to be added # step The step size for stepping through the rows (optional) # Result: # New dataset # proc ::dml::slice {dataset start stop {step 1}} { set names [lindex $dataset 0] set data [lindex $dataset 1] if { $step <= 0 } { return -code error "Step size must be positive" } set newset [dataset $names] set row $start while { $row <= $stop } { addrow newset [getrow $dataset $row] incr row $step } return $newset } # sum -- # Sum two datasets column by column # Arguments: # data1 The first dataset # data2 The second dataset or a plain list # Result: # New dataset # proc ::dml::sum {data1 data2} { # # Determine the number of columns first # Tricky, though # set names1 [lindex $data1 0] set values1 [lindex $data1 1] set norows1 [llength $values1] if { [llength $data2] != 2 } { set norows2 1 set names2 $data2 set values2 $data2 } else { # # Note: this logic is _not_ complete! # It fails for a proper dataset with 1 column and 1 row # if { [llength [lindex $data2 1]] == 1 } { set norows2 1 set names2 $data2 set values2 $data2 } else { set names2 [lindex $data2 0] set values2 [lindex $data2 1] set norows2 [llength $values2] } } if { $norows1 != $norows2 && $norows2 != 1 } { return -code error "Numbers of rows do not match" } if { [llength $names1] != [llength $names2] } { return -code error "Numbers of columns do not match" } set newset [dataset $names1] if { $norows2 > 1 } { foreach row1 $values1 row2 $values2 { set newrow {} foreach c1 $row1 c2 $row2 { lappend newrow [expr {$c1+$c2}] } addrow newset $newrow } } else { foreach row1 $values1 { set newrow {} foreach c1 $row1 c2 $values2 { lappend newrow [expr {$c1+$c2}] } addrow newset $newrow } } return $newset } # filter -- # Filter the rows of a dataset # Arguments: # dataset The dataset to be filtered # condition The condition for keeping a row # Result: # New dataset # Note: # To avoid possible conflicts between local # variables and the column names, we use a # private namespace for the local variables # proc ::dml::filter {dataset condition} { set v::names [lindex $dataset 0] set v::newset [dataset $v::names] set v::values [lindex $dataset 1] set v::cond $condition foreach v::row $v::values { foreach $v::names $v::row {break} if $v::cond { addrow v::newset $v::row } } return $v::newset } # print -- # Print the contents of a dataset to stdout # Arguments: # dataset The dataset to be printed # Result: # None # Side effect: # Contents shown on screen # proc ::dml::print {dataset} { set row 0 puts "Columns: [join [getnames $dataset] \t]" while {1} { set values [getrow $dataset $row] if { [llength $values] == 0 } { break } puts "$row: [join $values \t]" incr row } } if {0} { Let us test this code: } namespace import ::dml::* set table [dataset {A B}] # # Hm, candidate for a new command? # for {set i 0} {$i < 10} {incr i} { addrow table [list [expr {$i+1}] [expr {2*$i}]] } puts "Contents: [contents $table]" puts "Columns: [getnames $table]" puts "Records with A > 2" set newtable [filter $table {$A>2}] print $newtable puts "Records with A > 2 and B < 10" set newtable [filter $table {$A>2 && $B<10}] print $newtable puts "Summed table:" print [sum $table $table] puts "Slice of a table:" print [slice $table 2 4]
if {0} {A few remarks about the above code are appropriate:
- There is far less error checking than needed in a truly useful package, especially if it is to be used in a semi-interactive way.
- The above code ignores the possibility of missing values, most naturally represented as an empty stirng or list ("" or {}).
- The command [filter] does not deal with parameters (that is, a condition like {$A>$threshold}, where "threshold" is a local variable. One naive way of dealing with such conditions would be to have the user use a global or namespace variable instead, like: {$A>$::threshold}.
- With the current set of commands you can not manipulate columns.
[ Category Essay | Category Numerical Analysis ]