简体   繁体   English

在 Python (SageMath 9.0) - 1B 行上的文本文件 - 从特定行读取的最佳方式

[英]In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

I'm running SageMath 9.0, on Windows 10 OS我在 Windows 10 操作系统上运行 SageMath 9.0

I've read several similar questions (and answers) in this site.我在这个网站上阅读了几个类似的问题(和答案)。 Mainly this one one reading from the 7th line, and this one on optimizing.主要是从第 7 行读取的一篇,以及关于优化的这一篇 But I have some specific issues: I need to understand how to optimally read from a specific (possibly very far away) line, and if I should read line by line, or if reading by block could be "more optimal" in my case.但我有一些具体问题:我需要了解如何从特定(可能很远)行以最佳方式读取,以及我是否应该逐行读取,或者在我的情况下是否按块读取可能“更佳”。

I have a 12Go text file, made of around 1 billion small lines, all made of ASCII printable characters.我有一个 12Go 文本文件,由大约 10 亿行小行组成,全部由 ASCII 可打印字符组成。 Each line has a constant number of characters.每行都有固定数量的字符。 Here are the actual first 5 lines:以下是实际的前 5 行:

J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...

For context, this file is a list of all non-isomorphic graphs on 11-vertices, encoded using graph6 format.对于上下文,此文件是 11 个顶点上的所有非同构图的列表,使用 graph6 格式编码。 The file has been computed and made available by Brendan McKay on its webpage here .该文件由Brendan McKay计算并其网页上提供。

I need to check every graph for some properties.我需要检查每个图表的某些属性。 I could use the generator for G in graphs(11) but this can be very long (few days at least on my laptop).我可以for G in graphs(11)的生成器,但这可能会很长(至少在我的笔记本电脑上几天)。 I want to use the complete database in the file, so that I'm able to stop and start again from a certain point.我想在文件中使用完整的数据库,以便能够从某个点停止并重新开始。

My current code reads the file line by line from start, and do some computation after reading each line:我当前的代码从开始逐行读取文件,并在读取每一行后进行一些计算:

with open(filename,'r') as file:
    while True: 
        # Get next line from file 
        line = file.readline() 

        # if line is empty, end of file is reached 
        if not line: 
            print("End of Database Reached")
            break  
        
        G = Graph()
        from_graph6(G,line.strip())

        run_some_code(G)

In order to be able to stop the code, or save the progress in case of crash, I was thinking of:为了能够停止代码,或者在崩溃的情况下保存进度,我在想:

  • Every million line read (or so), save the progress in a specific file每读取一百万行(左右),将进度保存在特定文件中
  • When restarting the code, read the last saved value, and instead of using line = file.readline() , I would use itertool option, for line in islice(file, start_line, None) .重新启动代码时,读取最后保存的值,而不是使用line = file.readline() ,我会使用 itertool 选项for line in islice(file, start_line, None)

so that my new code is所以我的新代码是

 from itertools import islice
 start_line = load('foo')
 count = start_line 
 save_every_n_lines = 1000000


 with open(filename,'r') as file:
     for line in islice(file, start_line, None):
         G = Graph()
         from_graph6(G,line.strip())

         run_some_code(G)
         count +=1

         if (count % save_every_n_lines )==0:
             save(count,'foo')

The code does work, but I would like to understand if I can optimise it.该代码确实有效,但我想了解是否可以对其进行优化。 I'm not a big fan of my if statement in my for loop.我不喜欢我for循环中的if语句。

  • Is the itertools.islice() the good option here? itertools.islice()是这里的好选择吗? the document states "If start is non-zero, then elements from the iterable are skipped until start is reached".该文档指出“如果 start 不为零,则跳过 iterable 中的元素,直到达到 start ”。 As "start" could be quite large, ad given that I'm working on simple text files, could there be a faster option, in order to "jump" directly to the start line?由于“开始”可能非常大,鉴于我正在处理简单的文本文件,是否有更快的选择,以便直接“跳转”到开始行?
  • Knowing that the text file is fixed, could it be more optimal to split the actual file into 100 or 1000 smaller files and reading them one by one?知道文本文件是固定的,将实际文件拆分为 100 个或 1000 个较小的文件并逐个读取它们会更优化吗? this would get read of the if statement in my for loop.这将在我for循环中读取if语句。
  • I also have the option to read blocks of line in one go instead of line by line, and then work on a list of graphs.我还可以选择在 go 中读取行块,而不是逐行读取,然后处理图表列表。 Could that be a good option?这会是一个不错的选择吗?

Each line has a constant number of characters.每行都有固定数量的字符。 So "jumping" might be feasible.所以“跳跃”可能是可行的。

Assuming each line is the same size, you can use a memory mapped file read it by index without mucking about with seek and tell.假设每一行的大小相同,您可以使用 memory 映射文件按索引读取它,而无需使用 seek 和 tell。 The memory mapped file emulates a bytearray and you can take record-sized slices from the array for the data you want. bytearray映射文件模拟字节数组,您可以从数组中获取所需数据的记录大小的切片。 If you want to pause processing, you only have to save the current record index in the array and you can startup again with that index later.如果要暂停处理,只需将当前记录索引保存在数组中,以后可以使用该索引重新启动。

This example is on linux - mmap open on windows is a bit different - but after its setup, access should be the same.此示例在 linux 上 - 在 windows 上打开的 mmap 有点不同 - 但在设置后,访问权限应该相同。

import os
import mmap

# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1 

# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
    for i in range(100):
        f.write("R{: 10}\n".format(i).encode('ascii'))

f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ] 
print("record 20", record)

# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
    return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]

print("get record 20", get_record(data, 11))

# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
    if stop is None:
        stop = mmapped_file.size()/LINE_SZ
    for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
        yield mmapped_file[pos:pos+RECORD_SZ]

print("enum 6 to 8", [record for record in enum_records(data,6,9)])

del data
f.close()

If the length of the line is constant (in this case it's 12 (11 and endline character)), you might do如果行的长度是恒定的(在这种情况下是 12(11 和结束字符)),你可以这样做

def get_line(k, line_len):
    with open('file') as f:
        f.seek(k*line_len)
        return next(f)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM