简体   繁体   中英

Processing large files using Tcl

I have some information in two large files.
One of them( file1.txt , has ~ 4 million lines) contains all object names(which are unique) and types.
And the other( file2.txt , has ~ 2 million lines) some object names(they can be duplicated) and some values assigned to them.
So, I have something like below in file1.txt :

objName1 objType1
objName2 objType2
objName3 objType3
...

And in file2.txt I have:

objName3 val3_1
objName3 val3_2
objName4 val4
...

For the all objects in file2.txt I need to output object names, their types and values assigned to them in a single file like below:

objType3 val3_1 "objName3"
objType3 val3_2 "objName3"
objType4 val4 "objName4"
...

Previously object names in file2.txt supposed to be unique, so I've implemented some solution, where I'm reading all the data from both files, saving them to a Tcl arrays, and then iterating over larger array and checking whether object with the same name exists in a smaller array, and if so, writing my needed information to a separate file. But this runs too long (> 10 hours and hasn't completed yet).
How can I improve my solution, or is there another way to do this?

EDIT:
Actually I don't have file1.txt , I'm finding that data by some procedure and writing it into Tcl array. I'm running some procedure to get object types and save them to a Tcl array, then, I'm reading file2.txt and saving data to a Tcl array, then I'm iterating over items in the first array, and if object name match some object in second(object values) array, I'm writing info to output file and erasing that element from the second array. Here is a piece of code that I'm running:

set outFileName "output.txt"
if [catch {open $outFileName "w"} fid ] {
   puts "ERROR: Failed to open file '$outFileName', no write permission"
   exit 1
}


# get object types
set TIME_start [clock clicks -milliseconds]
array set objTypeMap [list]
# here is some proc that fills up objTypeMap
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object types are found. Elapsed time $TIME_taken"

# read file2.txt
set TIME_start [clock clicks -milliseconds]
set file2 [lindex $argv 5]
if [catch { set fp [open $file2 r] } errMsg] {
    puts "ERROR: Failed to open file '$file2' for reading"
    exit 1
}

set objValData [read $fp]
close $fp
# tcl list containing lines of file2.txt
set objValData [split $objValData "\n"]
# remove last empty line
set objValData [lreplace $objValData end end]
array set objValMap [list]
foreach item $objValData {
    set objName [string range $item 0 [expr {[string first " " $item] - 1}] ]
    set objValue [string range $item [expr {[string first " " $item] + 1}] end ]
    set objValMap($instName) $objValue
}
# clear objValData
unset objValData

set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object value data is read and processed. Elapsed time $TIME_taken"

# write to file
set TIME_start [clock clicks -milliseconds]
foreach { objName objType } [array get objTypeMap] {
    if { [array size objValMap] eq 0 } {
        break
    }
    if { [info exists objValMap($objName)] } {
        set objValue $objValMap($objName)
        puts $fid "$objType $objValue \"$objName\""
        unset objValMap($objName)
    }
}

if { [array size objValMap] neq 0 } {
    foreach { objName objVal } [array get objValMap] {
        puts "WARNING: Can not find object $objName type, skipped..."
    }
}
close $fid

set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Output is cretaed. Elapsed time $TIME_taken"

Seems for the last step (writing to a file) there are ~8 * 10^12 iterations to do, and it's not realistic to complete in a reasonable time, because I've tried to do 8 * 10^12 iterations in a for loop and just print the iteration index, and ~850*10^6 iterations took ~30 minutes (so, the whole loop will finish in ~11hours).
So, there should be another solution.

EDIT: Seems the reason was some unsuccessful hashing for file2.txt map, as I've tried to shuffle lines in file2.txt and got results in about 3 minutes.

将数据写入file1,然后让外部工具完成所有艰苦的工作(对于此任务,它比本地编写的Tcl代码要优化得多)

exec bash -c {join -o 0,1.2,2.2 <(sort file1.txt) <(sort file2.txt)} > result.txt

A pure-Tcl variant of Glenn Jackman's code would be

package require fileutil
package require struct::list

set data1 [lsort -index 0 [split [string trim [fileutil::cat file1.txt]] \n]]
set data2 [lsort -index 0 [split [string trim [fileutil::cat file2.txt]] \n]]
fileutil::writeFile result.txt [struct::list dbJoin -full 0 $data1 0 $data2]

But in this case each row will have four columns, not three: the two columns from file1.txt and the two columns from file2.txt. If that is a problem, reducing the number of columns to three is trivial.

The file join in the example is also full, ie all rows from both files will occur in the result, padded by empty strings if the other file has no corresponding data. To solve the OP's problem, an inner join is probably better (only rows that correspond are retained).

fileutil::cat reads the contents of a file, string trim removes leading and trailing whitespace from the contents, to avoid empty lines in the beginning or end, split ... \\n creates a list where every row becomes an item, lsort -index 0 sorts that list based on the first word in every item.

The code is verified to work with Tcl 8.6 and fileutil 1.14.8. The fileutil package is a part of the Tcllib companion library for Tcl: the package can be individually upgraded to the current version by downloading the Tcl source and copying it to the relevant location in the Tcl installation's lib tree ( C:\\Tcl\\lib\\teapot\\package\\tcl\\teapot\\tcl8\\8.2 in my case).

Quick-and-dirty install: download fileutil.tcl from here (use the Download button) and copy the file to where your other sources are. In your source code, call source fileutil.tcl and then package require fileutil . (There may still be compatibility problems with Tcl or with eg the cmdline package. Reading the source may suggest workarounds for such.) Remember to check the license conditions for conflicts.

Documentation: fileutil package, lsort , package , set , split , string , struct::list package

So… file1.txt is describing a mapping and file2.txt is the list of things to process and annotate? The right thing is to load the mapping into an array or dictionary where the key is the part that you will look things up by, and to then go through the other file line-by-line. That keeps the amount of data in memory down, but it's worth holding the whole mapping that way anyway.

# We're doing many iterations, so worth doing proper bytecode compilation 
apply {{filename1 filename2 filenameOut} {
    # Load the mapping; uses memory proportional to the file size
    set f [open $filename1]
    while {[gets $f line] >= 0} {
        regexp {^(\S+)\s+(.*)} $line -> name type
        set types($name) $type
    }
    close $f

    # Now do the streaming transform; uses a small fixed amount of memory
    set fin [open $filename2]
    set fout [open $filenameOut "w"]
    while {[gets $fin line] >= 0} {
        # Assume that the mapping is probably total; if a line fails we're print it as
        # it was before. You might have a different preferred strategy here.
        catch {
            regexp {^(\S+)\s+(.*)} $line -> name info
            set line [format "%s %s \"%s\"" $types($name) $info $name]
        }
        puts $fout $line
    }
    close $fin
    close $fout

    # All memory will be collected at this point
}} "file1.txt" "file2.txt" "fileProcessed.txt"

Now, if the mapping is very large, so much that it doesn't fit in memory, then you might be better doing it via building file indices and stuff like that, but frankly then you're actually better off getting familiar with SQLite or some other database.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM