简体   繁体   中英

How to process a very large file in python?

I am processing a very large file containing raw text in python. The content of the file has the following format:

  • each record is separated by point_separator
  • each field is separated by field_separator .
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator

Previously I was reading the complete file using the python file open API call:

with open(file_name, mode='rt', encoding='utf-8') as reader:
    text = reader.read()
    objects = text.strip().split('point_separator')
    ......

However, this fails with MemoryError when the file is very large ie 20GB and I am processing in a machine containing 16GB RAM.

The problem is I cannot read this file line by line as I have to collect all the fields based on field_separator until I see a point_separator .

Is there a way so that OS will use paging and it would be handled transparently?

You could write your own generator function that allows you to iterate over the file a record at a time, without ever reading the whole file into memory simultaneously. For example:

def myiter(filename, point_separator):
    with open(filename, mode='rt', encoding='utf-8') as reader:
        text = ''
        while True:
            line = reader.readline()
            if not line:
                break
            if line.strip() == point_separator:
                yield text
                text = ''
            else:
                text += line
    if text:
        yield text


# put in the actual separator values here - tested the version shown 
# in the question using literal "point_separator" and "field_separator" 
point_separator = 'point_separator' 
field_separator = 'field_separator' 
filename = 'test.txt'

for record in myiter(filename, point_separator):
    fields = record.split(field_separator + '\n')
    print(fields)

With the example version in the question, this gives:

['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']

You can then strip the newlines as you require (I haven't done this for you already, as I don't know if the fields may be multi-line.)

Also, I haven't done anything special with the "new record" and "call". You could do print(fields[1:-1]) to exclude these.

You could use itertools.groupby to iterate first by everything between "new record" lines and then internally by everything thing between "field_separator" lines. In the outer groupby, new_record will be true for all lines holding the text "new record" and new_record first goes false, you know you are in the record and do the inner groupby.

import itertools

def record_sep(line):
    return line.strip() == "new record"
    
def field_sep(line):
    return line.strip() == "field_separator"
    
records = []

with open('thefile') as fileobj:
    for new_record, record_iter in itertools.groupby(fileobj, record_sep):
        # skip new record group and proceed to field input
        if not new_record:
            record = []
            for new_field, field_iter in itertools.groupby(record_iter, field_sep):
                 # skip field separator group and proceed to value group
                if not new_field:
                    record.append(list(field_iter)) # assuming multiple values in field
            records.append(record)


for record in records:
    print(record)

process line by line

objects = []
fields = []
field = ''
with open(file_name, mode='rt', encoding='utf-8') as reader:
    for line in reader:
        line = line.strip()
        if 'point_separator' == line:
            objects.append(fields)
            fields = []
        elif 'field_separator' == line:
            fields.append(field)
            field = ''
        else:
            field += line + '\n'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM