How to remove all lines from a large text file (>60GB) that contains a specific alphabet in python?

Question

I have a large text file (>60GB) and I want to remove certain lines from it.

The text file contains:

352_0M, 352_1M,  0.913
500_1F, 452_0M,  0.500
870_0M, 400_1F,  0.980
601_1F, 470_0M,  0.630
845_0M, 900_1M,  0.456
100_1F, 250_0F,  0.123
...

I want to remove all lines containing the "F" alphabet in the first column or second column or both. The expected output is:

352_0M, 352_1M,  0.913
845_0M, 900_1M,  0.456

How to do this in python?

Answer 1

with open('input_file','r') as inf:
    with open('output_file','w') as outf:
        for line in inf:
            if not any('F' in x for x in line.split(',', 2)[:2]):
                outf.write(line)

Answer 2

A solution with numpy

import numpy as np
A = np.loadtxt('input_file',dtype=str,delimiter=', ')
id1 = [ 'F' not in a for a in A.T[0]]
id2 = [ 'F' not in a for a in A.T[1]]
B = A[np.bitwise_and(id1,id2)]
np.savetxt('file_out',B)

EDIT: thanks to the comments of Marcos and AMC. I correct myself when I though my proposed solution was a bit faster: it is not! The solution of Błotosmętek is much better both for performance and for RAM usage. I checked with a test file of 600 GB and the proposed numpy solution is twice worse than the Błotosmętek's one.

Answer 3

Try to split the file into multiple parts with python, then search for the specific words. It is very difficult to process the large sized files because it requires huge RAM capacity.

How to remove all lines from a large text file (>60GB) that contains a specific alphabet in python?

Question

3 answers

solution1
2 ACCPTED 2020-03-12 18:14:09

solution2
0 2020-03-12 18:21:30

solution3
0 2020-03-12 18:38:46

How to remove all lines from a large text file (>60GB) that contains a specific alphabet in python?

Question

3 answers

solution1 2 ACCPTED 2020-03-12 18:14:09

solution2 0 2020-03-12 18:21:30

solution3 0 2020-03-12 18:38:46

solution1
2 ACCPTED 2020-03-12 18:14:09

solution2
0 2020-03-12 18:21:30

solution3
0 2020-03-12 18:38:46