简体   繁体   中英

Suggestions on processing large file - python or command line?

Given two files, one containing entries of the form:

label1 label2 name1
label1 label3 name2

and the other of the form:

label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

Suppose you want to extract those lines from file two for which the first three elements appear in a line (order important) in file one - any suggestions on how this might be dome quickly?

The output file from any such script given the above sample data would be:

label1 label2 name1 0.1 1000

I toyed with python:

inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()

inp = open('file2', 'wt')

holder = []

line = inp.readline()
while line:
    line = line.split()
    if [line[0], line[1], line[2]] in look_up:
        holder.append(line)
    line = inp.readline()

However this seems to take a while. These files are rather large.

Thanks!

Your python version is rather inefficient because you're testing for membership in a list, rather than a set or a dict (ie O(n) lookup time instead of O(1)).

Try using a set of tuples or a set of strings instead. Tuples would be a better choice as the two files could be split on different delimiters, but I don't think you'll see a particularly large performance difference. tuple('something'.split()) is relatively fast compared to testing for the membership of a very long list.

Also, there's no need to call inp.readlines() . In other words, you could just do

look_up = set(tuple(line.split()) for line in inp)

And you should see a significant speedup without having to change any other parts of your code other than tuple(line[:3]) rather than [line[0], line[1], line[2]] .

Actually, grep and bash are pretty perfect for this... (Untested, but it should work.)

while read line
do
    grep "$line" "file2.txt"
done < "file1.txt"

To see which one is faster, we can generate some test data (~4500 keys in file1.txt and 1000000 lines in file2.txt ), and benchmark a simple python version of same thing (Roughly... The lines will be printed in a different order than the grep version.).

with open('file1.txt', 'r') as keyfile:
    lookup = set(tuple(line.split()) for line in keyfile)

with open('file2.txt', 'r') as datafile:
    for line in datafile:
        if tuple(line.split()[:3]) in lookup:
            print line,

The python version turns out to be ~70x faster:

jofer@cornbread:~/so> time sh so_temp149.sh > a

real    1m47.617s
user    0m51.199s
sys     0m54.391s

vs.

jofer@cornbread:~/so> time python so_temp149.py > b

real    0m1.631s
user    0m1.558s
sys     0m0.071s

Of course, the two examples are approaching the problem in entirely different ways. We're really comparing two algorithms, not two implementations. For example, if we only have a couple of key lines in file1 , the bash/grep solution easily wins.

(Does bash have a builtin container of some sort with O(1) lookup for membership? (I think bash 4 might have a hash table, but I don't know anything about it...) It would be interesting to try implementing a similar algorithm to the python example above in bash, as well...)

Hacky bash/sort/Perl solution:

$ cat > 1
label1 label2 name1
label1 label3 name2

$ cat > 2
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

$ (cat 1; cat 2; ) | sort | perl -ne 'INIT{$pattern_re="(?:\\S+) (?:\\S+) (?:\\S+)"; $current_pattern="";} if(/^($pattern_re)$/o){$current_pattern=$1} else {if(/^($pattern_re)/o) { print if $1 eq $current_pattern} }'
label1 label2 name1 0.1 1000

It merges both files into one list, sorts it (so we get chunks of data with the same key, lead by line from file 1), then use special Perl oneliner to leave only well-formed lines that have precending "header" from file 1.

您可以尝试使用字符串“label1 label2 name1”作为键,而不是使用值的三元组。

I'd use a hash to store the value from the first file. Not that error-resilience though (1 and only 1 space between each item) but you'll get the general idea...

#!/usr/bin/env python

labels={}
with open('log') as fd:
    for line in fd:
        line=line.strip()
        labels[line]=True

with open('log2') as fd:
    for line in fd:
        if " ".join(line.split()[0:3]) in labels:
            print line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM