[英]Reading a text file in blocks - losing data?
I have a huge text file that can be anywhere from 2 to 20 GBs. 我有一个巨大的文本文件,其大小从2到20 GB不等。 The content is a list of results from a given list of queries.
内容是给定查询列表的结果列表。 I am trying to send it into my 'parse' script in blocks, as I need each set of results to be read into memory so that I can do index operations on them.
我试图将其以块的形式发送到“解析”脚本中,因为我需要将每组结果都读取到内存中,以便可以对它们执行索引操作。 For some reason, when I load the entire file into memory, I get more parsing results than if I use the following code to chop my input file into blocks:
由于某种原因,当我将整个文件加载到内存中时,与使用以下代码将输入文件切成块相比,我得到的解析结果更多:
with open(infile, 'r') as rfile:
block = []
thresh = 100000
for i, line in enumerate(rfile):
if i >= thresh:
if "Iteration: 1" in line: # This marks the end of one set of results, and the beginning of the next, so we don't truncate any results
read_block(block)
thresh += 100000
del block[:]
block.append(line)
Any idea why I am losing data with this code? 知道为什么我会用此代码丢失数据吗? Or is everything kosher here, and my error is the result of this function interacting weirdly with the read_block() method...
还是这里的一切都是洁的,我的错误是此函数与read_block()方法怪异地交互的结果...
This nested if
is probably causing you problems (possibly with help from read_block
): 这个嵌套的
if
可能会导致您出现问题(可能在read_block
帮助read_block
):
if i >= thresh:
if "Iteration: 1" in line:
read_block(block)
thresh += 100000
del block[:]
It is equivalent to a compound if
condition, since it only runs the innermost block if both conditions are true: 它等效于复合
if
条件,因为它仅在两个条件都为真时才运行最里面的块:
if i >= thresh and "Iteration: 1" in line:
read_block(block)
thresh += 100000
del block[:]
So whenever this loop encounters a short set of results --- less than 100,000 lines --- the outer for
loop continues slurping in results sets until their combined length passes the threshold. 因此,只要此循环遇到一小套结果-少于100,000行--外部
for
循环就会继续拖入结果集中,直到它们的组合长度超过阈值为止。 When read_block
is finally called, it will be given a block
of lines that contains two or more sets. 最终调用
read_block
,将为其提供一行包含两个或更多集合的行block
。 Can read_block
cope with that? read_block
可以应付吗?
Also, if you have a particularly long results set --- or any combination of sets that don't add up to exactly 100,000 lines --- the threshold is completely ignored until the loop reaches the end of the current set. 另外,如果您的结果集特别长-或集的任何组合加起来不完全等于100,000行-阈值将被完全忽略,直到循环到达当前集的末尾。 If your other functions assume that
block
can never exceed 100,000 lines, they will get a rude surprise. 如果您的其他函数假定该
block
不能超过100,000行,那么它们将给您带来极大的惊喜。
Finally, thresh
is always incremented by 100,000, instead of however many lines were actually read. 最后,
thresh
总是增加100,000,而不是实际读取许多行。 Since this can't happen until i >= thresh
(possibly much greater-than), thresh
will lag farther and farther behind reality. 由于这是不可能发生的,直到
i >= thresh
(可能太多大于), thresh
将落后得越来越远落后于现实。 If that 100,000-line threshold really is important, you should set it to 100,000 lines from now : 如果该100,000行的阈值确实很重要,则应该从现在开始将其设置为100,000行:
thresh = i + 100000
To reiterate what I commented above, why not just feed read_block
(or read_one_results_set
or whatever) a single results set at a time, and not worry about counting lines? 要重申我在上面的评论,为什么不只一次将
read_block
(或read_one_results_set
或其他内容)提供给一个结果集,而不用担心计数行呢?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.