如何在 python 中处理一个非常大的文件？

Question

I am processing a very large file containing raw text in python.我正在处理一个非常大的文件，其中包含 python 中的原始文本。 The content of the file has the following format:该文件的内容具有以下格式：

each record is separated by point_separator每条记录由point_separator
each field is separated by field_separator .每个字段由field_separator 。

new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator

Previously I was reading the complete file using the python file open API call:以前我使用 python 文件打开 API 调用读取完整文件：

with open(file_name, mode='rt', encoding='utf-8') as reader:
    text = reader.read()
    objects = text.strip().split('point_separator')
    ......

However, this fails with MemoryError when the file is very large ie 20GB and I am processing in a machine containing 16GB RAM.但是，当文件非常大（即 20GB）并且我正在包含16GB RAM 的机器上处理时，这会因MemoryError而失败。

The problem is I cannot read this file line by line as I have to collect all the fields based on field_separator until I see a point_separator .问题是我无法逐行读取此文件，因为我必须根据field_separator收集所有字段，直到看到point_separator 。

Is there a way so that OS will use paging and it would be handled transparently?有没有办法让操作系统使用分页并透明地处理？

Answer 1

You could write your own generator function that allows you to iterate over the file a record at a time, without ever reading the whole file into memory simultaneously.您可以编写自己的生成器 function ，它允许您一次遍历文件一条记录，而无需同时将整个文件读入 memory。 For example:例如：

def myiter(filename, point_separator):
    with open(filename, mode='rt', encoding='utf-8') as reader:
        text = ''
        while True:
            line = reader.readline()
            if not line:
                break
            if line.strip() == point_separator:
                yield text
                text = ''
            else:
                text += line
    if text:
        yield text


# put in the actual separator values here - tested the version shown 
# in the question using literal "point_separator" and "field_separator" 
point_separator = 'point_separator' 
field_separator = 'field_separator' 
filename = 'test.txt'

for record in myiter(filename, point_separator):
    fields = record.split(field_separator + '\n')
    print(fields)

With the example version in the question, this gives:使用问题中的示例版本，这给出了：

['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']

You can then strip the newlines as you require (I haven't done this for you already, as I don't know if the fields may be multi-line.)然后，您可以根据需要去除换行符（我还没有为您完成此操作，因为我不知道这些字段是否可能是多行的。）

Also, I haven't done anything special with the "new record" and "call".另外，我没有对“新记录”和“通话”做任何特别的事情。 You could do print(fields[1:-1]) to exclude these.你可以做print(fields[1:-1])来排除这些。

Answer 2

You could use itertools.groupby to iterate first by everything between "new record" lines and then internally by everything thing between "field_separator" lines.您可以使用itertools.groupby首先通过“新记录”行之间的所有内容进行迭代，然后在内部通过“field_separator”行之间的所有内容进行迭代。 In the outer groupby, new_record will be true for all lines holding the text "new record" and new_record first goes false, you know you are in the record and do the inner groupby.在外部 groupby 中， new_record对于所有包含文本“新记录”的行都将为 true，并且new_record首先变为 false，您知道您在记录中并执行内部 groupby。

import itertools

def record_sep(line):
    return line.strip() == "new record"
    
def field_sep(line):
    return line.strip() == "field_separator"
    
records = []

with open('thefile') as fileobj:
    for new_record, record_iter in itertools.groupby(fileobj, record_sep):
        # skip new record group and proceed to field input
        if not new_record:
            record = []
            for new_field, field_iter in itertools.groupby(record_iter, field_sep):
                 # skip field separator group and proceed to value group
                if not new_field:
                    record.append(list(field_iter)) # assuming multiple values in field
            records.append(record)


for record in records:
    print(record)

Answer 3

process line by line逐行处理

objects = []
fields = []
field = ''
with open(file_name, mode='rt', encoding='utf-8') as reader:
    for line in reader:
        line = line.strip()
        if 'point_separator' == line:
            objects.append(fields)
            fields = []
        elif 'field_separator' == line:
            fields.append(field)
            field = ''
        else:
            field += line + '\n'

如何在 python 中处理一个非常大的文件？

问题描述

3 个解决方案

解决方案1
1 2020-08-15 05:08:35

解决方案2
1 2020-08-15 05:45:53

解决方案3
0 2020-08-15 05:01:49

如何在 python 中处理一个非常大的文件？

问题描述

3 个解决方案

解决方案1 1 2020-08-15 05:08:35

解决方案2 1 2020-08-15 05:45:53

解决方案3 0 2020-08-15 05:01:49

解决方案1
1 2020-08-15 05:08:35

解决方案2
1 2020-08-15 05:45:53

解决方案3
0 2020-08-15 05:01:49