简体   繁体   English

逐块读取文本文件-丢失数据?

[英]Reading a text file in blocks - losing data?

I have a huge text file that can be anywhere from 2 to 20 GBs. 我有一个巨大的文本文件,其大小从2到20 GB不等。 The content is a list of results from a given list of queries. 内容是给定查询列表的结果列表。 I am trying to send it into my 'parse' script in blocks, as I need each set of results to be read into memory so that I can do index operations on them. 我试图将其以块的形式发送到“解析”脚本中,因为我需要将每组结果都读取到内存中,以便可以对它们执行索引操作。 For some reason, when I load the entire file into memory, I get more parsing results than if I use the following code to chop my input file into blocks: 由于某种原因,当我将整个文件加载到内存中时,与使用以下代码将输入文件切成块相比,我得到的解析结果更多:

with open(infile, 'r') as rfile:
    block = []
    thresh = 100000
    for i, line in enumerate(rfile):

        if i >= thresh:
            if "Iteration: 1" in line: # This marks the end of one set of results, and the beginning of the next, so we don't truncate any results
                read_block(block)
                thresh += 100000
                del block[:]

        block.append(line)

Any idea why I am losing data with this code? 知道为什么我会用此代码丢失数据吗? Or is everything kosher here, and my error is the result of this function interacting weirdly with the read_block() method... 还是这里的一切都是洁的,我的错误是此函数与read_block()方法怪异地交互的结果...

This nested if is probably causing you problems (possibly with help from read_block ): 这个嵌套的if可能会导致您出现问题(可能在read_block帮助read_block ):

if i >= thresh:
    if "Iteration: 1" in line:
        read_block(block)
        thresh += 100000
        del block[:]

It is equivalent to a compound if condition, since it only runs the innermost block if both conditions are true: 它等效于复合if条件,因为它仅在两个条件都为真时才运行最里面的块:

if i >= thresh and "Iteration: 1" in line:
    read_block(block)
    thresh += 100000
    del block[:]

So whenever this loop encounters a short set of results --- less than 100,000 lines --- the outer for loop continues slurping in results sets until their combined length passes the threshold. 因此,只要此循环遇到一小套结果-少于100,000行--外部for循环就会继续拖入结果集中,直到它们的组合长度超过阈值为止。 When read_block is finally called, it will be given a block of lines that contains two or more sets. 最终调用read_block ,将为其提供一行包含两个或更多集合的行block Can read_block cope with that? read_block可以应付吗?

Also, if you have a particularly long results set --- or any combination of sets that don't add up to exactly 100,000 lines --- the threshold is completely ignored until the loop reaches the end of the current set. 另外,如果您的结果集特别长-或集的任何组合加起来不完全等于100,000行-阈值将被完全忽略,直到循环到达当前集的末尾。 If your other functions assume that block can never exceed 100,000 lines, they will get a rude surprise. 如果您的其他函数假定该block不能超过100,000行,那么它们将给您带来极大的惊喜。

Finally, thresh is always incremented by 100,000, instead of however many lines were actually read. 最后, thresh总是增加100,000,而不是实际读取许多行。 Since this can't happen until i >= thresh (possibly much greater-than), thresh will lag farther and farther behind reality. 由于这是不可能发生的,直到i >= thresh (可能太多大于), thresh将落后得越来越远落后于现实。 If that 100,000-line threshold really is important, you should set it to 100,000 lines from now : 如果该100,000行的阈值确实很重要,则应该从现在开始将其设置为100,000行:

thresh = i + 100000

To reiterate what I commented above, why not just feed read_block (or read_one_results_set or whatever) a single results set at a time, and not worry about counting lines? 要重申我在上面的评论,为什么不只一次将read_block (或read_one_results_set或其他内容)提供给一个结果集,而不用担心计数行呢?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM