逐块读取文本文件-丢失数据？

Question

I have a huge text file that can be anywhere from 2 to 20 GBs. 我有一个巨大的文本文件，其大小从2到20 GB不等。 The content is a list of results from a given list of queries. 内容是给定查询列表的结果列表。 I am trying to send it into my 'parse' script in blocks, as I need each set of results to be read into memory so that I can do index operations on them. 我试图将其以块的形式发送到“解析”脚本中，因为我需要将每组结果都读取到内存中，以便可以对它们执行索引操作。 For some reason, when I load the entire file into memory, I get more parsing results than if I use the following code to chop my input file into blocks: 由于某种原因，当我将整个文件加载到内存中时，与使用以下代码将输入文件切成块相比，我得到的解析结果更多：

with open(infile, 'r') as rfile:
    block = []
    thresh = 100000
    for i, line in enumerate(rfile):

        if i >= thresh:
            if "Iteration: 1" in line: # This marks the end of one set of results, and the beginning of the next, so we don't truncate any results
                read_block(block)
                thresh += 100000
                del block[:]

        block.append(line)

Any idea why I am losing data with this code? 知道为什么我会用此代码丢失数据吗？ Or is everything kosher here, and my error is the result of this function interacting weirdly with the read_block() method... 还是这里的一切都是洁的，我的错误是此函数与read_block（）方法怪异地交互的结果...

Answer 1

This nested if is probably causing you problems (possibly with help from read_block ): 这个嵌套的if可能会导致您出现问题（可能在read_block帮助read_block ）：

if i >= thresh:
    if "Iteration: 1" in line:
        read_block(block)
        thresh += 100000
        del block[:]

It is equivalent to a compound if condition, since it only runs the innermost block if both conditions are true: 它等效于复合if条件，因为它仅在两个条件都为真时才运行最里面的块：

if i >= thresh and "Iteration: 1" in line:
    read_block(block)
    thresh += 100000
    del block[:]

So whenever this loop encounters a short set of results --- less than 100,000 lines --- the outer for loop continues slurping in results sets until their combined length passes the threshold. 因此，只要此循环遇到一小套结果-少于100,000行--外部for循环就会继续拖入结果集中，直到它们的组合长度超过阈值为止。 When read_block is finally called, it will be given a block of lines that contains two or more sets. 最终调用read_block ，将为其提供一行包含两个或更多集合的行block 。 Can read_block cope with that? read_block可以应付吗？

Also, if you have a particularly long results set --- or any combination of sets that don't add up to exactly 100,000 lines --- the threshold is completely ignored until the loop reaches the end of the current set. 另外，如果您的结果集特别长-或集的任何组合加起来不完全等于100,000行-阈值将被完全忽略，直到循环到达当前集的末尾。 If your other functions assume that block can never exceed 100,000 lines, they will get a rude surprise. 如果您的其他函数假定该block不能超过100,000行，那么它们将给您带来极大的惊喜。

Finally, thresh is always incremented by 100,000, instead of however many lines were actually read. 最后， thresh总是增加100,000，而不是实际读取许多行。 Since this can't happen until i >= thresh (possibly much greater-than), thresh will lag farther and farther behind reality. 由于这是不可能发生的，直到i >= thresh （可能太多大于）， thresh将落后得越来越远落后于现实。 If that 100,000-line threshold really is important, you should set it to 100,000 lines from now : 如果该100,000行的阈值确实很重要，则应该从现在开始将其设置为100,000行：

thresh = i + 100000

To reiterate what I commented above, why not just feed read_block (or read_one_results_set or whatever) a single results set at a time, and not worry about counting lines? 要重申我在上面的评论，为什么不只一次将read_block （或read_one_results_set或其他内容）提供给一个结果集，而不用担心计数行呢？

逐块读取文本文件-丢失数据？

问题描述

1 个解决方案

解决方案1
0 2014-11-17 00:51:26

逐块读取文本文件-丢失数据？

问题描述

1 个解决方案

解决方案1 0 2014-11-17 00:51:26

解决方案1
0 2014-11-17 00:51:26