简体   繁体   中英

python beginner - faster way to find and replace in large file?

I have a file of about 100 million lines in which I want to replace text with alternate text stored in a tab-delimited file. The code that I have works, but is taking about an hour to process the first 70K lines.In trying to incrementally advance my python skills, I am wondering whether there is a faster way to do this. Thanks! The input file looks something like this:

CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518

and the file with replacement values looks like this:

WBGene00045518 21ur-5153

Here is my code:

infile1 = open('f1.txt', 'r')
infile2 = open('f2.txt', 'r')
outfile = open('out.txt', 'w')

import re
from datetime import datetime
startTime = datetime.now()

udict = {}
for line in infile1:
    line = line.strip()
    linelist = line.split('\t')
    udict1 = {linelist[0]:linelist[1]} 
    udict.update(udict1)

mult10K = []
for x in range(100):
    mult10K.append(x * 10000)   
linecounter = 0
for line in infile2:
    for key, value in udict.items():
        matches = line.count(key)
        if matches > 0: 
            print key, value
            line = line.replace(key, value)
            outfile.write(line + '\n')
        else:
            outfile.write(line + '\n')
    linecounter += 1
    if linecounter in mult10K:
        print linecounter   
        print (datetime.now()-startTime)
infile1.close()
infile2.close()
outfile.close()

You should split your lines into "words" and only look up these words in your dictionary:

>>> re.findall(r"\w+", "CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518")
['CHROMOSOME_IV', 'ncRNA', 'gene', '5723085', '5723105', 'ID', 'Gene', 'WBGene00045518', 'CHROMOSOME_IV', 'ncRNA', 'ncRNA', '5723085', '5723105', 'Parent', 'Gene', 'WBGene00045518']

This will eliminate the loop over the dictionary you do for every single line.

Here' the complete code:

import re

with open("f1.txt", "r") as infile1:
    udict = dict(line.strip().split("\t", 1) for line in infile1)

with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
    for line in infile2:
        for word in re.findall(r"\w+", line):
            if word in udict:
                line = line.replace(word, udict[word])
        outfile.write(line)

Edit : An alternative approach is to build a single mega-regex from your dictionary:

with open("f1.txt", "r") as infile1:
    udict = dict(line.strip().split("\t", 1) for line in infile1)
regex = re.compile("|".join(map(re.escape, udict)))
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
    for line in infile2:
        outfile.write(regex.sub(lambda m: udict[m.group()], line))

I was thinking on your loop over the dicionary keys, and a wqya to optimize this, and let to make other comments on your code later.

But then I stumbled on this part:

if linecounter in mult10K:
    print linecounter   
    print (datetime.now()-startTime)

This inocent looking snippet, actually puts Python sequentially looking at and comparing 10000 items in your "linecounter" list for each line in your file.

Replace this part with:

if linecounter % 10000 == 0:
    print linecounter   
    print (datetime.now()-startTime)

(And forget all the mult10k part) - and you should get a significant speed up.

Also, it seems like you are recording multiple output lines for each input line - your mainloop is like this:

linecounter = 0
for line in infile2:
    for key, value in udict.items():
        matches = line.count(key)
        if matches > 0: 
            print key, value
            line = line.replace(key, value)
            outfile.write(line + '\n')
        else:
            outfile.write(line + '\n')
    linecounter += 1

Replace it for this:

for linecounter, line in enumerate(infile2):
    for key, value in udict.items():
        matches = line.count(key)
        if matches > 0: 
            print key, value
            line = line.replace(key, value)
    outfile.write(line + '\n')

Which properly writes only one output line for each input line (besides eleminating code duplication, and taking care of the line counting in a "pythonic" way)

This code is full of linear searches. It's no wonder it's running slowly. Without knowing more about the input, I can't give you advice on how to fix these problems, but I can at least point out the problems. I'll note major issues, and a couple of minor ones.

udict = {}
for line in infile1:
    line = line.strip()
    linelist = line.split('\t')
    udict1 = {linelist[0]:linelist[1]} 
    udict.update(udict1)

Don't use update here; just add the item to the dictionary:

    udict[linelist[0]] = linelist[1]

This will be faster than creating a dictionary for every entry. (And actually, Sven Marnach 's generator-based approach to creating this dictionary is better still.) This is fairly minor though.

mult10K = []
for x in range(100):
    mult10K.append(x * 10000)

This is totally unnecessary. Remove this; I'll show you one way to print at intervals without this.

linecounter = 0
for line in infile2:
    for key, value in udict.items():

This is your first big problem. You're doing a linear search through the dictionary for keys in the line, for each line. If the dictionary is very large, this will require a huge number of operations: 100,000,000 * len(udict).

        matches = line.count(key)

This is another problem. You're looking for matches using a linear search. Then you do replace , which does the same linear search! You don't need to check for a match; replace just returns the same string if there isn't one. This won't make a huge difference either, but it will gain you something.

        line = line.replace(key, value)

Keep doing these replaces, and then only write the line once all replacements are done:

    outfile.write(line + '\n')

And finally,

    linecounter += 1
    if linecounter in mult10K:

Forgive me, but this is a ridiculous way to do this! You're doing a linear search through linecounter to determine when to print a line. Here again, this adds a total of almost 100,000,000 * 100 operations. You should at least search in a set; but the best approach (if you really must do this) would be to do a modulo operation and test that.

    if not linecounter % 10000: 
        print linecounter   
        print (datetime.now()-startTime)

To make this code efficient, you need to get rid of these linear searches. Sven Marnach 's answer suggests one way that might work, but I think it depends on the data in your file, since the replacement keys might not correspond to obvious word boundaries. (The regex approach he added addresses that, though.)

This is not Python specific, but you might unroll your double for loop a bit so that the file writes to not occur on every iteration of the loop. Perhaps write to the file every 1000 or 10,000 lines.

I'm hoping that writing a line of output for each line of input times the number of replacement strings is a bug, and you really only intended to write one output for each input.

You need to find a way to test the lines of input for matches as quickly as possible. Looping through the entire dictionary is probably your bottleneck.

I believe regular expressions are precompiled into state machines that can be highly efficient. I have no idea on how the performance suffers when you generate a huge expression, but it's worth a try.

freakin_huge_re = re.compile('(' + ')|('.join(udict.keys()) + ')')
for line in infile2:
    matches = [''.join(tup) for tup in freakin_huge_re.findall(line)]
    if matches:
        for key in matches:
            line = line.replace(key, udict[key])

They obvious one in Python is the list comprehension - it's a faster (and more readable) way of doing this:

mult10K = []
for x in range(100):
    mult10K.append(x * 10000)

as this:

mult10K = [x*10000 for x in range(100)]

Likewise, where you have:

udict = {}
for line in infile1:
    line = line.strip()
    linelist = line.split('\t')
    udict1 = {linelist[0]:linelist[1]} 
    udict.update(udict1)

We can use a dict comprehension (with a generator expression):

lines = (line.strip().split('\t') for line in infile1)
udict = {line[0]: line[1] for line in lines}

It's also worth noting here that you appear to be working with a tab delimited file. In which case, the csv module might be a much better option than using split() .

Also note that using the with statement will increase readability and make sure your files get closed (even on exceptions).

Print statements will also slow things down quite a lot if they are being performed on every loop - they are useful for debugging, but when running on your main chunk of data, it's probably worth removing them.

Another 'more pythonic' thing you can do is use enumerate() instead of adding one to a variable each time. Eg:

linecounter = 0
for line in infile2:
   ...
   linecouter += 1

Can be replaced with:

for linecounter, line in enumerate(infile2):
    ...

Where you are counting occurrences of a key, the better solution is to use in :

if key in line:

As this short-circuits after finding an instance.

Adding all this up, let's see what we have:

import csv
from datetime import datetime
startTime = datetime.now()

with open('f1.txt', 'r') as infile1:
    reader = csv.reader(delimiter='\t')
    udict = dict(reader)

with open('f2.txt', 'r') as infile2, open('out.txt', 'w') as outfile:
    for line in infile2:
        for key, value in udict.items():
            if key in line: 
                line = line.replace(key, value)
        outfile.write(line + '\n')

Edit: List comp vs normal loop, as requested in the comments:

python -m timeit "[i*10000 for i in range(10000)]"
1000 loops, best of 3: 909 usec per loop

python -m timeit "a = []" "for i in range(10000):" "  a.append(i)"
1000 loops, best of 3: 1.01 msec per loop

Note usec vs msec. It's not massive, but it's something.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM