简体   繁体   中英

How to remove lines from a text file based the values in a list?

I have a very large text file(coverage.txt) >2G, and it looks like this:

#RefName    Pos Coverage
BGC0000001_59320bp  0   0
BGC0000001_59320bp  1   0
BGC0000002_59320bp  2   0
BGC0000002_59320bp  3   0
BGC0000002_59320bp  4   0
BGC0000003_59320bp  5   0
BGC0000003_59320bp  6   0
BGC0000003_59320bp  7   0
BGC0000004_59320bp  8   0
BGC0000004_59320bp  7   0
BGC0000004_59320bp  8   0
BGC0000005_59320bp  7   0
BGC0000005_59320bp  8   0
BGC0000005_59320bp  7   0
BGC0000006_59320bp  8   0
BGC0000006_59320bp  7   0
BGC0000006_59320bp  8   0
BGC0000007_59320bp  7   0
BGC0000007_59320bp  8   0
BGC0000007_59320bp  7   0
BGC0000008_59320bp  8   0
BGC0000008_59320bp  7   0
BGC0000008_59320bp  8   0
BGC0000009_59320bp  7   0
BGC0000009_59320bp  8   0

I have another text file(rmList.txt) like this:

BGC0000002
BGC0000004
BGC0000006
BGC0000008

I want to remove those lines from my coverage.txt file if the lines contain the IDs in the rmList.txt.

Here's what I tried:

wanted = [line.strip() for line in open('rmList.txt')]
files = 'coverage.txt'

def rmUnwanted(file):
    with open(file) as f, open('out.txt', 'w') as s:
        for line in f:
            pos = line.split()[0].split('_')[0]
            if pos not in wanted:
                s.write(line)

rmUnwanted(files)

But this takes forever for my large files. Is there a better way to do this? Is there anything wrong with my code?

Thank you so much!

use set instead of list to check duplicate elements.

wanted = { line.strip() for line in open('rmList.txt') }

....

It seems to me that the code is not wrong, it does what you want. But with large files it will require time. You may still work on efficiency.

If you are sure that both your files are already sorted (as it seems from your example), this code should be faster:

def rmUnwanted(file):
    with open(file) as f, open('out.txt', 'w') as s:
        i = 0
        lastwanted = ""
        for line in f:
            pos = line.split()[0].split('_')[0]
            try:
                if pos not in [wanted[i], lastwanted]:
                    s.write(line)
                else:
                    if pos == wanted[i]:
                        lastwanted = wanted[i]
                        i = i+1
            except IndexError:
                s.write(line)

It gives the same result using the example files you provided, but is faster (I did not measure it, but shoud be). What I do here is to avoid to look for pos in the whole wanted list at each iteration, which is time consuming if also your real rmList.txt is large.

You can do it as follows:

with open("rmLst.txt") as f:
    rmLst = set(f.readlines())

with open("out.txt", "w") as outf, open("coverage.txt") as inf:
    # write header
    outf.write(next(inf))
    # write lines that do not start with a banned ID
    outf.writelines(line for line in inf if line[:line.index("_")] not in rmList)

First, you store all IDs to remove in a set for fast lookup. Then, iterate over lines and check if each line starts with a bad ID. Note that instead of running line.split() we can check access the ID portion of each line with line[:line.index['_']] . This avoids creating a copy of each line and should be faster than split . If all IDs have constant length, you can replace line.index['_'] with a number.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM