简体   繁体   中英

Use grep on file in Python

I have searched the grep answers on here and cannot find an answer. They all seem to search for a string in a file, not a list of strings from a file. I already have a search function that works, but grep does it WAY faster. I have a list of strings in a file sn.txt (with one string on each line, no deliminators). I want to search another file (Merge_EXP.exp) for lines that have a match and write it out to a new file. The file I am searching in has a half millions lines, so searching for a few thousand in there takes hours without grep.

When I run it from command prompt in windows, it does it in minutes:

grep --file=sn.txt Merge_EXP.exp > Merge_EXP_Out.exp

How can I call this same process from Python? I don't really want alternatives in Python because I already have one that works but takes a while. Unless you think you can significantly improve the performance of that:

def match_SN(serialnumb, Exp_Merge, output_exp):
    fout = open(output_exp,'a')
    f = open(Exp_Merge,'r')
    # skip first line
    f.readline()
    for record in f:
        record = record.strip().rstrip('\n')
        if serialnumb in record:
            fout.write (record + '\n')
    f.close()
    fout.close()

def main(Output_CSV, Exp_Merge, updated_exp):

    # create a blank output
    fout = open(updated_exp,'w')

    # copy header records
    f = open(Exp_Merge,'r')
    header1 = f.readline()
    fout.write(header1)
    header2 = f.readline()
    fout.write(header2)
    fout.close()
    f.close()

    f_csv = open(Output_CSV,'r')
    f_csv.readline()
    for rec in f_csv:
        rec_list = rec.split(",")
        sn = rec_list[2]
        sn = sn.strip().rstrip('\n')
        match_SN(sn,Exp_Merge,updated_exp)

Here is a optimized version of pure python code:

def main(Output_CSV, Exp_Merge, updated_exp):
    output_list = []

    # copy header records
    records = open(Exp_Merge,'r').readlines()
    output_list = records[0:2]

    serials = open(Output_CSV,'r').readlines()
    serials = [x.split(",")[2].strip().rstrip('\n') for x in serials]

    for s in serials:
        items = [x for x in records if s in x]
        output_list.extend(items)

    open(updated_exp, "w").write("".join(output_list))

main("sn.txt", "merge_exp.exp", "outx.txt")

Input

sn.txt:

x,y,0011
x,y,0002

merge_exp.exp:

Header1
Header2
0011abc
0011bcd
5000n
5600m
6530j
0034k
2000lg
0002gg

Output

Header1
Header2
0011abc
0011bcd
0002gg

Try this out and see how much time it takes...

When I use full path to grep location it worked (I pass it the grep_loc, Serial_List, Export):

import os

Export_Dir = os.path.dirname(Export)
Export_Name = os.path.basename(Export)

Output = Export_Dir + "\Output_" + Export_Name
print "\nOutput: " + Output + "\n"

cmd = grep_loc + " --file=" + Serial_List + " " + Export + " > " + Output
print "grep usage: \n" + cmd + "\n"
os.system(cmd)
print "Output created\n"

I think you have not chosen the right title for your question: What you want to do is the equivalent of a database JOIN. You can use grep for that in this particular instance, because one of your files only has keys and no other information. However, I think it is likely (but of course I don't know your case) that in the future your sn.txt may also contain extra information.

So I would solve the generic case. There are multiple solutions:

  • import all data into a database, then do a LEFT JOIN (in sql) or equivalent
  • use a python large data tool

For the latter, you could try numpy or, recommended because you are working with strings, pandas . Pandas has an optimized merge routine, which is very fast in my experience (uses cython under the hood).

Here is pandas PSEUDO code to solve your problem. It is close to real code but I need to know the names of the columns that you want to match on. I assumed here the one column in sn.txt is called key , and the matching column in merge_txt is called sn . I also see you have two header lines in merge_exp, read the docs for that.

# PSEUDO CODE (but close)
import pandas
left = pandas.read_csv('sn.txt')
right = pandas.read_csv('merge_exp.exp')
out = pandas.merge(left, right, left_on="key", right_on="sn", how='left')
out.to_csv("outx.txt")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM