[英]How to process a very large file in python?
I am processing a very large file containing raw text in python.我正在处理一个非常大的文件,其中包含 python 中的原始文本。 The content of the file has the following format:
该文件的内容具有以下格式:
point_separator
point_separator
field_separator
.field_separator
。new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
Previously I was reading the complete file using the python file open API call:以前我使用 python 文件打开 API 调用读取完整文件:
with open(file_name, mode='rt', encoding='utf-8') as reader:
text = reader.read()
objects = text.strip().split('point_separator')
......
However, this fails with MemoryError
when the file is very large ie 20GB and I am processing in a machine containing 16GB
RAM.但是,当文件非常大(即 20GB)并且我正在包含
16GB
RAM 的机器上处理时,这会因MemoryError
而失败。
The problem is I cannot read this file line by line as I have to collect all the fields based on field_separator
until I see a point_separator
.问题是我无法逐行读取此文件,因为我必须根据
field_separator
收集所有字段,直到看到point_separator
。
Is there a way so that OS will use paging and it would be handled transparently?有没有办法让操作系统使用分页并透明地处理?
You could write your own generator function that allows you to iterate over the file a record at a time, without ever reading the whole file into memory simultaneously.您可以编写自己的生成器 function ,它允许您一次遍历文件一条记录,而无需同时将整个文件读入 memory。 For example:
例如:
def myiter(filename, point_separator):
with open(filename, mode='rt', encoding='utf-8') as reader:
text = ''
while True:
line = reader.readline()
if not line:
break
if line.strip() == point_separator:
yield text
text = ''
else:
text += line
if text:
yield text
# put in the actual separator values here - tested the version shown
# in the question using literal "point_separator" and "field_separator"
point_separator = 'point_separator'
field_separator = 'field_separator'
filename = 'test.txt'
for record in myiter(filename, point_separator):
fields = record.split(field_separator + '\n')
print(fields)
With the example version in the question, this gives:使用问题中的示例版本,这给出了:
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
You can then strip the newlines as you require (I haven't done this for you already, as I don't know if the fields may be multi-line.)然后,您可以根据需要去除换行符(我还没有为您完成此操作,因为我不知道这些字段是否可能是多行的。)
Also, I haven't done anything special with the "new record" and "call".另外,我没有对“新记录”和“通话”做任何特别的事情。 You could do
print(fields[1:-1])
to exclude these.你可以做
print(fields[1:-1])
来排除这些。
You could use itertools.groupby
to iterate first by everything between "new record" lines and then internally by everything thing between "field_separator" lines.您可以使用
itertools.groupby
首先通过“新记录”行之间的所有内容进行迭代,然后在内部通过“field_separator”行之间的所有内容进行迭代。 In the outer groupby, new_record
will be true for all lines holding the text "new record" and new_record
first goes false, you know you are in the record and do the inner groupby.在外部 groupby 中,
new_record
对于所有包含文本“新记录”的行都将为 true,并且new_record
首先变为 false,您知道您在记录中并执行内部 groupby。
import itertools
def record_sep(line):
return line.strip() == "new record"
def field_sep(line):
return line.strip() == "field_separator"
records = []
with open('thefile') as fileobj:
for new_record, record_iter in itertools.groupby(fileobj, record_sep):
# skip new record group and proceed to field input
if not new_record:
record = []
for new_field, field_iter in itertools.groupby(record_iter, field_sep):
# skip field separator group and proceed to value group
if not new_field:
record.append(list(field_iter)) # assuming multiple values in field
records.append(record)
for record in records:
print(record)
process line by line逐行处理
objects = []
fields = []
field = ''
with open(file_name, mode='rt', encoding='utf-8') as reader:
for line in reader:
line = line.strip()
if 'point_separator' == line:
objects.append(fields)
fields = []
elif 'field_separator' == line:
fields.append(field)
field = ''
else:
field += line + '\n'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.