简体   繁体   English

使用python解析大型(20GB)文本文件 - 以2行读取1

[英]Parsing large (20GB) text file with python - reading in 2 lines as 1

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them. 我正在解析一个20Gb的文件,并将符合某种条件的行输出到另一个文件,但偶尔python会同时读取2行并连接它们。

inputFileHandle = open(inputFileName, 'r')

row = 0

for line in inputFileHandle:
    row =  row + 1
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

I've checked the line endings in the source file and they check out as line feeds (ascii char 10). 我检查了源文件中的行结尾,并将它们检出为换行符(ascii char 10)。 Pulling out the problem rows and parsing them in isolation works as expected. 拉出问题行并在隔离中解析它们按预期工作。 Am I hitting some python limitation here? 我在这里遇到一些python限制吗? The position in the file of the first anomaly is around the 4GB mark. 第一个异常文件中的位置大约是4GB标记。

Quick google search for "python reading files larger than 4gb" yielded many many results. 快速谷歌搜索“大于4gb的python阅读文件”产生了许多结果。 See here for such an example and another one which takes over from the first . 在这里看到这样一个例子 和另一个从第一个接管 的例子

It's a bug in Python. 这是Python中的一个错误。

Now, the explanation of the bug; 现在,解释这个bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread(). 重现并不容易,因为它取决于内部FILE缓冲区大小和传递给fread()的字符数。 In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF." 在Microsoft CRT源代码中,在open.c中,有一个块从这个令人鼓舞的评论开始“这是困难的部分。我们在缓冲区末尾发现了一个CR。我们必须先看看下一个char是否是LF。 “ Oddly, there is an almost exact copy of this function in Perl source code: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; 奇怪的是,在Perl源代码中有一个几乎完全相同的函数副本: http//perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3 : /win32/win32.c#l3668问题在于调用到SetFilePointer(),用于在前瞻后退一个位置; it will fail because it is unable to return the current position in a 32bit DWORD. 它会失败,因为它无法以32位DWORD返回当前位置。 [The fix is easy; [修复很容易; do you see it?] At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back. 你看到了吗?]此时,函数认为下一个read()将返回LF,但它不会,因为文件指针没有被移回。

And the work-around: 解决方法:

But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); 但请注意,Python 3.x不受影响(原始文件始终以二进制模式打开,CRLF转换由Python完成); with 2.7, you may use io.open(). 使用2.7,您可以使用io.open()。

The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32). 4GB标记可疑地接近可以存储在32位寄存器(2 ** 32)中的最大值。

The code you've posted looks fine by itself, so I would suspect a bug in your Python build. 您发布的代码本身看起来很好,所以我怀疑您的Python构建中存在错误。

FWIW, the snippet would be a little cleaner if it used enumerate : FWIW,如果使用枚举 ,代码片段会更清晰:

inputFileHandle = open(inputFileName, 'r')

for row, line in enumerate(inputFileHandle):
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM