简体   繁体   中英

How to read data corresponds to specific line numbers from a 60GB text file in python?

I have a text file (1 Billion lines) of 60GB size. I have to extract data corresponds to specified line numbers which can be read from another text file (eg:1, 4, 70, 100...etc). Due to the size I can't load data to memory and then extract lines. Also, line by line matching and extraction would take many days of time. Is there any solution exist for this problem?

2 methods which I tried:

f = open('line_numbers.txt')
lines = f.readlines()
numbers =[int(e.strip()) for e in lines]
r = max(numbers)
file = open('OUTPUT_RESULT.txt','w') 
with open('Large_File.txt') as infile:
        for num, line in enumerate(infile,1):
                if (num<= r):
                        if (num in numbers):
                                file.write(line)
                        else:
                                pass
                        print(num)

It will take many days to get the result

import pandas as pd
data = pd.read_csv('Large_File.txt', header=None)
file = open('OUTPUT_RESULT.txt','w') 

f = open('line_numbers.txt')
lines = f.readlines()
numbers =[int(e.strip()) for e in lines]

x = data.loc[numbers,:]
file.write(x)

It does not load file to memory

Is there any solution available to resolve this?

Your issue is probably with the if (num in numbers) line. Not only does it not need the parentheses, but it also checks this for every iteration, even though your code goes through the file in order (first line 1, then line 2, etc.).

That can be easily optimised and doing so, the code below ran in only 12 seconds on a test file of about 50 million lines. It should process your file in a few minutes.

import random

numbers = sorted([random.randint(1, 50000000) for _ in range(1000)])
outfile = open('specific_lines.txt', 'w')
with open('archive_list.txt', 'r', encoding='cp437') as infile:
    for num, line in enumerate(infile, 1):
        if numbers:
            if num == numbers[0]:
                outfile.write(line)
                print(num)
                del numbers[0]
            else:
                pass

Note: this generates a 1,000 random line numbers, replace with your loaded numbers like in your example. If your list of number is far greater, the write time for the output file will increase execution time somewhat.

Your code would be like:

with open('line_numbers.txt') as f:
    lines = f.readlines()
numbers = sorted([int(e.strip()) for e in lines])
outfile = open('specific_lines.txt', 'w')
with open('archive_list.txt', 'r', encoding='cp437') as infile:
    for num, line in enumerate(infile, 1):
        if numbers:
            if num == numbers[0]:
                outfile.write(line)
                print(num)
                del numbers[0]
            else:
                pass

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM