简体   繁体   English

如何在 python 中处理一个非常大的文件?

[英]How to process a very large file in python?

I am processing a very large file containing raw text in python.我正在处理一个非常大的文件,其中包含 python 中的原始文本。 The content of the file has the following format:该文件的内容具有以下格式:

  • each record is separated by point_separator每条记录由point_separator
  • each field is separated by field_separator .每个字段由field_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator

Previously I was reading the complete file using the python file open API call:以前我使用 python 文件打开 API 调用读取完整文件:

with open(file_name, mode='rt', encoding='utf-8') as reader:
    text = reader.read()
    objects = text.strip().split('point_separator')
    ......

However, this fails with MemoryError when the file is very large ie 20GB and I am processing in a machine containing 16GB RAM.但是,当文件非常大(即 20GB)并且我正在包含16GB RAM 的机器上处理时,这会因MemoryError而失败。

The problem is I cannot read this file line by line as I have to collect all the fields based on field_separator until I see a point_separator .问题是我无法逐行读取此文件,因为我必须根据field_separator收集所有字段,直到看到point_separator

Is there a way so that OS will use paging and it would be handled transparently?有没有办法让操作系统使用分页并透明地处理?

You could write your own generator function that allows you to iterate over the file a record at a time, without ever reading the whole file into memory simultaneously.您可以编写自己的生成器 function ,它允许您一次遍历文件一条记录,而无需同时将整个文件读入 memory。 For example:例如:

def myiter(filename, point_separator):
    with open(filename, mode='rt', encoding='utf-8') as reader:
        text = ''
        while True:
            line = reader.readline()
            if not line:
                break
            if line.strip() == point_separator:
                yield text
                text = ''
            else:
                text += line
    if text:
        yield text


# put in the actual separator values here - tested the version shown 
# in the question using literal "point_separator" and "field_separator" 
point_separator = 'point_separator' 
field_separator = 'field_separator' 
filename = 'test.txt'

for record in myiter(filename, point_separator):
    fields = record.split(field_separator + '\n')
    print(fields)

With the example version in the question, this gives:使用问题中的示例版本,这给出了:

['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']

You can then strip the newlines as you require (I haven't done this for you already, as I don't know if the fields may be multi-line.)然后,您可以根据需要去除换行符(我还没有为您完成此操作,因为我不知道这些字段是否可能是多行的。)

Also, I haven't done anything special with the "new record" and "call".另外,我没有对“新记录”和“通话”做任何特别的事情。 You could do print(fields[1:-1]) to exclude these.你可以做print(fields[1:-1])来排除这些。

You could use itertools.groupby to iterate first by everything between "new record" lines and then internally by everything thing between "field_separator" lines.您可以使用itertools.groupby首先通过“新记录”行之间的所有内容进行迭代,然后在内部通过“field_separator”行之间的所有内容进行迭代。 In the outer groupby, new_record will be true for all lines holding the text "new record" and new_record first goes false, you know you are in the record and do the inner groupby.在外部 groupby 中, new_record对于所有包含文本“新记录”的行都将为 true,并且new_record首先变为 false,您知道您在记录中并执行内部 groupby。

import itertools

def record_sep(line):
    return line.strip() == "new record"
    
def field_sep(line):
    return line.strip() == "field_separator"
    
records = []

with open('thefile') as fileobj:
    for new_record, record_iter in itertools.groupby(fileobj, record_sep):
        # skip new record group and proceed to field input
        if not new_record:
            record = []
            for new_field, field_iter in itertools.groupby(record_iter, field_sep):
                 # skip field separator group and proceed to value group
                if not new_field:
                    record.append(list(field_iter)) # assuming multiple values in field
            records.append(record)


for record in records:
    print(record)

process line by line逐行处理

objects = []
fields = []
field = ''
with open(file_name, mode='rt', encoding='utf-8') as reader:
    for line in reader:
        line = line.strip()
        if 'point_separator' == line:
            objects.append(fields)
            fields = []
        elif 'field_separator' == line:
            fields.append(field)
            field = ''
        else:
            field += line + '\n'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM