Updated 2017-12-18 08:43:11 by User0086

Slurp! -- reading a file edit

The Tcl file commands are file, open, close, gets and read, and puts, seek, tell, and eof, fblocked, fconfigure, Tcl_StandardChannels(3), flush, fileevent, filename.

One way to get file data in Tcl is to 'slurp' up the file into a text variable. This works really well if the files are known to be small.
#  Slurp up the data file
set fp [open "somefile" r]
set file_data [read $fp]
close $fp

Now you can split file_data into lines, and process it to your heart's content. NOTE: The mention of split is important here- input data is seldom well-behaved/structured, and needs to be processed in this way to ensure that any potential Tcl metacharacters are appropriately quoted into list format.
#  Process data file
set data [split $file_data "\n"]
foreach line $data {
     # do some line processing here
}

To incrementally read a large amount of data from a channel using an arbitrary end-of-line delimiter, have a look at:

http://ynform.org/w/Pub/TclGetsd

[split $file_data] would remove runs of whitespace from the original data. to avoid this, use a single character like "\n", as in [split $file_datat "\n"]

NEM split doesn't lose any data (except the newlines). All of the data in the file is still completely present in the list that split returns. In particular, [join [split $data \n] \n] will result in the same file contents.

In the following scenario, some whitespace is lost because split is used without specifying a character split on, causing it to use (and remove) runs of whitespace as the delimiter:
$ cat /tmp/testdata.txt
This is a test.
How in the world will this work?
If I have a [ or I have
a ] and other such things, will they remain?
And what about   extra spaces or even a tab?

$ tclsh8.6
% set fd [open "/tmp/testdata.txt" "r"]
file4
% set a [read $fd]
This is a test.
How in the world will this work?
If I have a [ or I have
a ] and other such things, will they remain?
And what about   extra spaces or even a tab?


% set b [split $a]
This is a test. How in the world will this work? If I have a {[} or I have a \] and other such things, will they remain? And what about {} {} extra spaces or even a tab? {} {}
% set c [join $b]
This is a test. How in the world will this work? If I have a [ or I have a ] and other such things, will they remain? And what about   extra spaces or even a tab?  
% puts $c
This is a test. How in the world will this work? If I have a [ or I have a ] and other such things, will they remain? And what about   extra spaces or even a tab?  
% set fo [open "/tmp/testo.txt" "w"] 
file5
% puts $fo $c
% close $fo
% exit
srv20 (178) $ cmp /tmp/testdata.txt /tmp/testo.txt
/tmp/testdata.txt /tmp/testo.txt differ: char 16, line 1

Most of the newlines in the original file are gone, as is the tab that was in /tmp/testdata.txt right before the word tab.

Another alternative to split is regexp

RWT

It even works well for large files, but there is one trick that might be necessary. Determine the size of the file first, then 'read' that many bytes. This allows the channel code to optimize buffer handling (preallocation in the correct size). I don't know anymore who posted this first. But you only need this for Tcl 8.0. This is something for the Tcl Performance page as well.
#  Slurp up the data file, optmimized buffer handling
#  Only needed for Tcl 8.0
set fsize [file size "somefile"]
set fp [open "somefile" r]
set data [read $fp $fsize]
close $fp

dizzy

TclX has a read_file command:
set data  [read_file -nonewline $filename]
set bytes [read_file $filename $numbytes]

Also under Unix if you are not concerned about performance you can do:
set data [exec cat $filename]

AK

LV AK, what would be the advantage of using exec and cat to read in the file in this manner? Just curious.

For the simple task of reading a whole file and splitting it into a list, there is a critcl version over at loadf.

The comment above stating that you must use split to make sure you can deal with the contents is only true if you want to use list commands (such as foreach in the example). If you always treat the data as a string then you don't have to worry about unbalanced braces and such. One way to avoid split if you are just searching for stuff on each line is to use the -line option of regexp which forces matching line by line, and combine it with the -inline switch to return matches as a list instead of placing them in variables and iterate over that list e.g.
foreach {fullmatch submatch1 submatch2} [regexp -line -inline $exp $str] {
    # process matches here - don't have to worry about skipping lines
    # because only matches make it here.
}

This assumes the exp contains an expression with 2 paren'd subsections.

LV With regards to the comment about exp containing an expression - do you mean like:
set exp {(a).*(b)}

and also, should that $str be $data (if set in the same universe as the read example above?

BBH

glennj - Note that you must use the -all switch, otherwise the foreach loop will only execute once (on the first match):
foreach {line sub1 sub2} [regexp -all -line -inline $exp $str] {...}

LV same here, right - exp as I mention above, and str should be data? What if what I am looking for is a single regular expression, or even a constant string?

If you want to receive the input from the file line by line, without having to split it and worry about splitting and losing brackets is to use fconfigure, and then read the data line by line. e.g.
#  read the file one line at a time
set fp [open "somefile" r]
fconfigure $fp -buffering line
gets $fp data
while {$data != ""} {
     puts $data
     gets $fp data
}
close $fp

This is also very useful for command-response type network sockets (POP, SMTP, etc.) mailto:douglas@networkhackers.com

ZB 2009-11-18 I've got a feeling the above example has serious flaw: it'll stop reading at first "empty" line in the file. My proposal would be rather:
#  read the file one line at a time
set fp [open "somefile" r]
while { [gets $fp data] >= 0 } {
     puts $data
}
close $fp

Not sure, is "fconfigure $fp -buffering line" really necessary.

Back to the original topic of reading in data: it just astonished CL to search the Wiki for what used to be the most common input idioms, and not find them at all. Before memory seemed so inexpensive, input was commonly done as
set fp [open $some_file]
while {-1 != [gets $fp line]} {
    puts "The current line is '$line'."
}

Newcomers often try to write this with eof, and generally confuse themselves in the process [1]. It calls for
set fp [open $some_file]
while 1 {
    set line [gets $fp line]
    if [eof $fp] break
    puts "The current line is '$line'."
}

or equivalent.

smh Minor fix CL in your 2nd example - the line
set line [gets $fp line]

should read either
set linelength [gets $fp line]

or simply
gets $fp line

due to the syntax of gets which when passed a 2nd argument reads the line into the named variable and returns line length.

Quick Parse huge file  edit

fforeach : file foreach is my implementation to speed up the file parsing line by line.
fforeach will manage the open close, don't break it by return inside.
Feel free to change the encoding : fconfigure $fforeach_fid -encoding utf-8
Here utf-8 support all world chars
# hkassem at gmail dot com - 2016
proc fforeach {fforeach_line_ref fforeach_file_path fforeach_body} {
    upvar $fforeach_line_ref fforeach_line
        set fforeach_fid [open $fforeach_file_path r]
    fconfigure $fforeach_fid -encoding utf-8
    while {[gets $fforeach_fid fforeach_line] >= 0} {
        # ------- FOREACH BODY ------------<
            uplevel $fforeach_body
        # ------END FOREACH BODY----------->
    }          
        close $fforeach_fid
 }

usage:
fforeach aLine "./mybigfile.txt" {
    # actions: do something   with the line
    puts $aLine  
}

See also ::fileutil::foreachLine, which does roughly the same thing.

Writing a file edit

I just noticed that there isn't an example of writing a file, so here goes: note: When you are writing into a file , the contents of the file will not be visible till the end of the execution of the TCL script. so if you are going to use "tail -f " midway to check out , I am sorry to say you will find that the size of the file will be 0. Its contents are visible only after the file has finished executing. LV Of course, if there is a need to see the file while it is open, just be certain your code invokes flush, which will force the output to the file instead of waiting until there is a large enough chunk of data to force out.
# create some data
set data "This is some test data.\n"
# pick a filename - if you don't include a path,
#  it will be saved in the current directory
set filename "test.txt"
# open the filename for writing
set fileId [open $filename "w"]
# send the data to the file -
#  omitting '-nonewline' will result in an extra newline
# at the end of the file
puts -nonewline $fileId $data
# close the file, ensuring the data is written out before you continue
#  with processing.
close $fileId

A simple tk example using the text widget is at Text Widget Example.

so 4/21/01

But a file without newline at end of last line is not considered too well-behaved. Since in this example, a plain string is puts'ed, I'd advocate a newline here. It's different when you save a text widget's content, where a newline after the last line is always guaranteed; there the -nonewline switch is well in place. RS

LV RS, in the example above, $data has a newline in it, so it should be all good.

(About the extra newline thing... under Unix, (possibly POSIX) all text files are supposed to end in a blank line, hence the "extra" newline. This is the proper behavior, even if it isn't technically required for most things these days. Unix text editors and tools still enforce it, however. Should it be considered a bug if Tcl behaves this way under Win32? --CJU)

LV This comment puzzles me. I have never seen a requirement for text files to end in blank lines. I have seen a few broken programs which generated an error if the last line in a file didn't end in a newline - but that doesn't create a blank line. And I don't understand the last question - if Tcl behaves which way under Win32?

In situations where output is line buffered (default for text), puts -nonewline does not immediately deliver the output. One solution, if this is a problem, is to add flush $fileId.

An alternative is to “fconfigure $fileId -buffering none” to force automatic flushing whenever data is written. Andreas Kupries.
DKF: By default, when you get the data will depend on how much buffering is done; write a large amount and almost all of it will end up on disk immediately. The default buffering mode is to build up a block of data (usually a few kilobytes) and write that all at once, since that's much more efficient when writing lines frequently. But this is easy to tune with fconfigure (as Andreas mentions); line-buffered is great when you're writing data to places like shared logfiles or for simple interactive use, and unbuffered is best for complicated interactions. All Tcl's channels are block buffered by default except for stdout (and only if it is going to a terminal) which is line buffered, and stderr which is unbuffered (because that's useful for diagnosing a crash or other forced exit).

Using Lisp pattern for reading and writing edit

I've found quite useful to use this pattern to work with files
proc with-open-file {fname mode fp block} {
    upvar 1 $fp fpvar
    set binarymode 0
    if {[string equal [string index $mode end] b]} {
            set mode [string range $mode 0 end-1]
            set binarymode 1
    }
    set fpvar [open $fname $mode]
    if {$binarymode} {
            fconfigure $fpvar -translation binary
    }
    uplevel 1 $block
    close $fpvar
}

Usage is like this:
with-open-file $fname w fp {
    puts -nonewline $fp $data
    puts $fp "more data"
}
with-open-file $fname r fp {
    set line [gets $fp]
}

This scheme hides implementation details and therefore allows modifying of file-handling at run-time (by adjusting with-open-file).

More about this at Emulating closures in Tcl page. --mfi.

Reading and writing binary data and unusual encodings edit

Anyone have the know-how to add to this page to cover reading and writing binary data and data from various languages (ie interacting with the encoding system)?

RS: If the encoding is known, just add the command
fconfigure $fp -encoding $enc

between the [open] and the first read access. From then on, it's transparent like reading pure ASCII. See also Unicode file reader, A little Unicode editor.

AK: Regarding the other part of the question see Working with binary data.

Often new programmers stop by comp.lang.tcl and ask "how can I replace the information in just one line of Tcl". Tcl uses something called standard I/O for the input and output to text and standard binary files (as does many other applications). Tcl itself does not provide a simple method for doing this. Using tcl's low level file input/output, one can do a very crude method of this, making use of open, seek, puts, and close. This would allow you to replace a string of N bytes with another string of the same number of bytes.

If the OS being used has other means of accessing and updating files, the developer will likely need to find an OS specific solution for this. There are a number of extensions for interacting with relational and other types of databases (such as Oracle, Sybase, etc.).

I always worried of losing data, having to filter out or replace all command chars before using one of .tcl's list commands (like e.g. lindex, lrange, lsearch) but i also never wanted to lose the convenience they offer, so i wrote a proc to use instead for my IRC bot i wrote in tclsh, for the lindex command, but it will not exactly be giving the same output but may return also the " " (space) which is part of the requested lindex. By adding ! to the number ar- gument it also returns the space after the requested lindex. I'am not sure whether this also works for other types of data, i only tested it with: smtp, irc, and xml/rss, and it works fine. This is my proc:
proc xindex { data num } {
set index [set 0 [set 1 [set 2 [set 3 0]]]]

if {[string index $num 0] == "!"} { set 1 ! ; set num [string range $num 1 end] }

while {$0 <= [string length $data] && $index != $num} {

set 3 [string index $data $0]

if {[string index $data [expr [expr $0] +1]] == " " && $3 != " "} { incr index } ; set 2 [incr 0] }

if {$num != 0} { set data [string range $data 1 end] }

while {[string index $data $2] == " "} { incr 2 }

while {[string index $data $2] != " " && $2 <= [string length $data] } { incr 2 }

if {[string index $data $2] == " " && $1 != "!"} { set 2 [expr [expr $2] - 1] }

return [string range $data $0 $2] }

I hope someone may have benefit from it, it still works fine for me, but i have never used it on other types of data i specified.

- Obzolete

overloading file handling functions

[cammy] 2012-07-19 - i tried source read_file.tcl but "command not found" is printed

RKzn 2016-07-19 - source does read the file. BUT it also evaluates it as a Tcl script. It is not meant for general reading operations. Thye "command not found" error is most likely due to the file not containing a script.

[potrzebie] - 2012-10-07 05:11:25

Use read's -nonewline switch if you're going to split the result with \n. Otherwise, if the file ends with a newline, you'll get an extra, empty, "false" element at the end of the lines list.

[rlugg] - 2016-02-18 23:32:34

I like the "Using Lisp pattern for reading and writing" example. I believe there is a minor error. There should be an added:
set binarymode 0

right after the upvar line. I wasn't certain, so didn't dare to edit it myself.

AM You are right - I made this correction (and a correction in the formatting).