简体   繁体   English

读取非常大的文件,其中格式是换行符独立的

[英]reading very large file where format is newline independent

My Python code supports reading and writing data in a file format created by others called the BLT format . 我的Python代码支持以其他人创建的文件格式读取和写入数据,称为BLT格式 The BLT format is white space and newline independent in that a newline is treated just like other white space. BLT格式是空格和换行符独立,因为换行符就像其他空格一样。 The primary entry in this format is a "ballot" which ends with a "0", eg, 这种格式的主要条目是“选票”,以“0”结尾,例如,

1 2 3 0

Since the format is newline independent, it could also be written as 由于格式是换行符,因此也可以写为

1 2
3 0

Or you could have multiple ballots on a line: 或者你可以在一条线上进行多次投票:

1 2 3 0 4 5 6 0

These files can be very large so I don't want to read an entire file into memory. 这些文件可能非常大,所以我不想将整个文件读入内存。 Line-based reading is complicated since the data is not line-based. 基于行的读取很复杂,因为数据不是基于行的。 What is a good way to process these files in a memory-efficient way? 以高效内存的方式处理这些文件的好方法是什么?

For me, the most straightforward way to solve this is with generators. 对我来说,解决这个问题最直接的方法是使用发电机。

def tokens(filename):
    with open(filename) as infile:
        for line in infile:
            for item in line.split():
                yield int(item)

def ballots(tokens):
    ballot = []
    for t in tokens:
        if t:
            ballot.append(t)
        else:
            yield ballot
            ballot = []

t = tokens("datafile.txt")

for b in ballots(t):
    print b

I see @katrielalex posted a generator-using solution while I was posting mine. 我看到@katrielalex发布了一个使用生成器的解决方案,而我发布了我的解决方案。 The difference between ours is that I'm using two separate generators, one for the individual tokens in the file and one for the specific data structure you wish to parse. 我们之间的区别在于我使用了两个独立的生成器,一个用于文件中的单个标记,另一个用于您要解析的特定数据结构。 The former is passed to the latter as a parameter, the basic idea being that you can write a function like ballots() for each of the data structures you wish to parse. 前者作为参数传递给后者,基本思想是你可以为你想要解析的每个数据结构编写类似ballots()的函数。 You can either iterate over everything yielded by the generator, or call next() on either generator to get the next token or ballot (be prepared for a StopIteration exception when you run out, or else write the generators to generate a sentinel value such as None when they run out of real data, and check for that). 您可以迭代生成器生成的所有内容,或者调用任一生成器上的next()以获取下一个标记或选票(在用完时准备好StopIteration异常,或者编写生成器以生成标记值,例如None当他们用完了真实数据,并检查这一点)。

It would be pretty straightforward to wrap the whole thing in a class. 将整个事物包装在一个类中是非常简单的。 In fact... 事实上...

class Parser(object):

    def __init__(self, filename):

        def tokens(filename):
            with open(filename) as infile:
                for line in infile:
                    for item in line.split():
                        yield int(item)

        self.tokens = tokens(filename)

    def ballots(self):
        ballot = []
        for t in self.tokens:
            if t:
                ballot.append(t)
            else:
                yield ballot
                ballot = []

p = Parser("datafile.txt")

for b in p.ballots():
    print b

Use a generator : 使用发电机

>>> def ballots(f):
...     ballots = []
...     for line in f:
...             for token in line.split():
...                     if token == '0':
...                             yield ballots
...                             ballots = []
...                     else:
...                             ballots.append(token)

This will read the file line by line, split on all whitespace, and append the tokens in the line one by one to a list. 这将逐行读取文件,在所有空格上拆分,并将行中的标记逐个附加到列表中。 Whenever a zero is reached, that ballot is yield ed and the list reset to empty. 每当达到零时,该选票将被yield并且列表重置为空。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM