简体   繁体   中英

How to remove all lines from a large text file (>60GB) that contains a specific alphabet in python?

I have a large text file (>60GB) and I want to remove certain lines from it.

The text file contains:

352_0M, 352_1M,  0.913
500_1F, 452_0M,  0.500
870_0M, 400_1F,  0.980
601_1F, 470_0M,  0.630
845_0M, 900_1M,  0.456
100_1F, 250_0F,  0.123
...

I want to remove all lines containing the "F" alphabet in the first column or second column or both. The expected output is:

352_0M, 352_1M,  0.913
845_0M, 900_1M,  0.456

How to do this in python?

with open('input_file','r') as inf:
    with open('output_file','w') as outf:
        for line in inf:
            if not any('F' in x for x in line.split(',', 2)[:2]):
                outf.write(line)

A solution with numpy

import numpy as np
A = np.loadtxt('input_file',dtype=str,delimiter=', ')
id1 = [ 'F' not in a for a in A.T[0]]
id2 = [ 'F' not in a for a in A.T[1]]
B = A[np.bitwise_and(id1,id2)]
np.savetxt('file_out',B)

EDIT: thanks to the comments of Marcos and AMC. I correct myself when I though my proposed solution was a bit faster: it is not! The solution of Błotosmętek is much better both for performance and for RAM usage. I checked with a test file of 600 GB and the proposed numpy solution is twice worse than the Błotosmętek's one.

Try to split the file into multiple parts with python, then search for the specific words. It is very difficult to process the large sized files because it requires huge RAM capacity.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM