[英]In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line
I'm running SageMath 9.0, on Windows 10 OS我在 Windows 10 操作系统上运行 SageMath 9.0
I've read several similar questions (and answers) in this site.我在这个网站上阅读了几个类似的问题(和答案)。 Mainly this one one reading from the 7th line, and this one on optimizing.
主要是从第 7 行读取的这一篇,以及关于优化的这一篇。 But I have some specific issues: I need to understand how to optimally read from a specific (possibly very far away) line, and if I should read line by line, or if reading by block could be "more optimal" in my case.
但我有一些具体问题:我需要了解如何从特定(可能很远)行以最佳方式读取,以及我是否应该逐行读取,或者在我的情况下是否按块读取可能“更佳”。
I have a 12Go text file, made of around 1 billion small lines, all made of ASCII printable characters.我有一个 12Go 文本文件,由大约 10 亿行小行组成,全部由 ASCII 可打印字符组成。 Each line has a constant number of characters.
每行都有固定数量的字符。 Here are the actual first 5 lines:
以下是实际的前 5 行:
J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...
For context, this file is a list of all non-isomorphic graphs on 11-vertices, encoded using graph6 format.对于上下文,此文件是 11 个顶点上的所有非同构图的列表,使用 graph6 格式编码。 The file has been computed and made available by Brendan McKay on its webpage here .
该文件由Brendan McKay计算并在其网页上提供。
I need to check every graph for some properties.我需要检查每个图表的某些属性。 I could use the generator
for G in graphs(11)
but this can be very long (few days at least on my laptop).我可以
for G in graphs(11)
的生成器,但这可能会很长(至少在我的笔记本电脑上几天)。 I want to use the complete database in the file, so that I'm able to stop and start again from a certain point.我想在文件中使用完整的数据库,以便能够从某个点停止并重新开始。
My current code reads the file line by line from start, and do some computation after reading each line:我当前的代码从开始逐行读取文件,并在读取每一行后进行一些计算:
with open(filename,'r') as file:
while True:
# Get next line from file
line = file.readline()
# if line is empty, end of file is reached
if not line:
print("End of Database Reached")
break
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
In order to be able to stop the code, or save the progress in case of crash, I was thinking of:为了能够停止代码,或者在崩溃的情况下保存进度,我在想:
line = file.readline()
, I would use itertool option, for line in islice(file, start_line, None)
.line = file.readline()
,我会使用 itertool 选项for line in islice(file, start_line, None)
。 so that my new code is所以我的新代码是
from itertools import islice
start_line = load('foo')
count = start_line
save_every_n_lines = 1000000
with open(filename,'r') as file:
for line in islice(file, start_line, None):
G = Graph()
from_graph6(G,line.strip())
run_some_code(G)
count +=1
if (count % save_every_n_lines )==0:
save(count,'foo')
The code does work, but I would like to understand if I can optimise it.该代码确实有效,但我想了解是否可以对其进行优化。 I'm not a big fan of my
if
statement in my for
loop.我不喜欢我
for
循环中的if
语句。
itertools.islice()
the good option here? itertools.islice()
是这里的好选择吗? the document states "If start is non-zero, then elements from the iterable are skipped until start is reached".if
statement in my for
loop.for
循环中读取if
语句。 Each line has a constant number of characters.每行都有固定数量的字符。 So "jumping" might be feasible.
所以“跳跃”可能是可行的。
Assuming each line is the same size, you can use a memory mapped file read it by index without mucking about with seek and tell.假设每一行的大小相同,您可以使用 memory 映射文件按索引读取它,而无需使用 seek 和 tell。 The memory mapped file emulates a
bytearray
and you can take record-sized slices from the array for the data you want. bytearray
映射文件模拟字节数组,您可以从数组中获取所需数据的记录大小的切片。 If you want to pause processing, you only have to save the current record index in the array and you can startup again with that index later.如果要暂停处理,只需将当前记录索引保存在数组中,以后可以使用该索引重新启动。
This example is on linux - mmap open on windows is a bit different - but after its setup, access should be the same.此示例在 linux 上 - 在 windows 上打开的 mmap 有点不同 - 但在设置后,访问权限应该相同。
import os
import mmap
# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1
# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
for i in range(100):
f.write("R{: 10}\n".format(i).encode('ascii'))
f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("record 20", record)
# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]
print("get record 20", get_record(data, 11))
# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
if stop is None:
stop = mmapped_file.size()/LINE_SZ
for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
yield mmapped_file[pos:pos+RECORD_SZ]
print("enum 6 to 8", [record for record in enum_records(data,6,9)])
del data
f.close()
If the length of the line is constant (in this case it's 12 (11 and endline character)), you might do如果行的长度是恒定的(在这种情况下是 12(11 和结束字符)),你可以这样做
def get_line(k, line_len):
with open('file') as f:
f.seek(k*line_len)
return next(f)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.