I am processing a very large file containing raw text in python. The content of the file has the following format:
point_separator
field_separator
.new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
new record
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
another field
field_separator
call
point_separator
Previously I was reading the complete file using the python file open API call:
with open(file_name, mode='rt', encoding='utf-8') as reader:
text = reader.read()
objects = text.strip().split('point_separator')
......
However, this fails with MemoryError
when the file is very large ie 20GB and I am processing in a machine containing 16GB
RAM.
The problem is I cannot read this file line by line as I have to collect all the fields based on field_separator
until I see a point_separator
.
Is there a way so that OS will use paging and it would be handled transparently?
You could write your own generator function that allows you to iterate over the file a record at a time, without ever reading the whole file into memory simultaneously. For example:
def myiter(filename, point_separator):
with open(filename, mode='rt', encoding='utf-8') as reader:
text = ''
while True:
line = reader.readline()
if not line:
break
if line.strip() == point_separator:
yield text
text = ''
else:
text += line
if text:
yield text
# put in the actual separator values here - tested the version shown
# in the question using literal "point_separator" and "field_separator"
point_separator = 'point_separator'
field_separator = 'field_separator'
filename = 'test.txt'
for record in myiter(filename, point_separator):
fields = record.split(field_separator + '\n')
print(fields)
With the example version in the question, this gives:
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
['new record\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'another field\n', 'call\n']
You can then strip the newlines as you require (I haven't done this for you already, as I don't know if the fields may be multi-line.)
Also, I haven't done anything special with the "new record" and "call". You could do print(fields[1:-1])
to exclude these.
You could use itertools.groupby
to iterate first by everything between "new record" lines and then internally by everything thing between "field_separator" lines. In the outer groupby, new_record
will be true for all lines holding the text "new record" and new_record
first goes false, you know you are in the record and do the inner groupby.
import itertools
def record_sep(line):
return line.strip() == "new record"
def field_sep(line):
return line.strip() == "field_separator"
records = []
with open('thefile') as fileobj:
for new_record, record_iter in itertools.groupby(fileobj, record_sep):
# skip new record group and proceed to field input
if not new_record:
record = []
for new_field, field_iter in itertools.groupby(record_iter, field_sep):
# skip field separator group and proceed to value group
if not new_field:
record.append(list(field_iter)) # assuming multiple values in field
records.append(record)
for record in records:
print(record)
process line by line
objects = []
fields = []
field = ''
with open(file_name, mode='rt', encoding='utf-8') as reader:
for line in reader:
line = line.strip()
if 'point_separator' == line:
objects.append(fields)
fields = []
elif 'field_separator' == line:
fields.append(field)
field = ''
else:
field += line + '\n'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.