[英]Reading a huge file in Python: Why am I getting a Segmentation Fault?
I know I shouldn't read the whole file into memory at once, but I'm not doing that. 我知道我不应该立刻将整个文件读入内存,但我不这样做。
I thought maybe I was doing something memory-heavy inside the loop, and got rid of everything until I was left with this: 我想也许我在循环中做了一些记忆沉重的事情,然后摆脱了一切,直到我离开这个:
with open("huge1.txt", "r") as f:
for line in f:
pass
It gave me a Segmentation Fault. 它给了我一个分段错误。
If I got everything right, iterating over a file like that is lazy and shouldn't load more than one line at a time into memory. 如果我把一切都搞定了,那么迭代这样的文件是懒惰的,不应该一次加载多行到内存中。
I also tried using islice
, but with the same results. 我也尝试过使用
islice
,但结果相同。
My file is line based, the lines are all short and the size of the file is around 6 GB. 我的文件是基于行的,行都很短,文件的大小约为6 GB。
What am I missing? 我错过了什么?
A segmentation fault should not occur no matter what, because python interpreter should catch errors and raise exceptions in the language. 无论如何都不应该发生分段错误,因为python解释器应该捕获错误并在语言中引发异常。 So your python interpreter has a bug for sure.
所以你的python解释器肯定有一个bug。
Now, as for what could trigger the bug. 现在,至于什么可以触发这个bug。 You read the file line by line, discarding each line once you have read the next line (actually retaining 2 lines at a time, because the previous line cannot be discarded until the assignment of the next line is complete).
您逐行读取文件,一旦读取下一行就丢弃每一行(实际上一次保留2行,因为在完成下一行的分配之前不能丢弃前一行)。
So, if it runs out of memory (which is a likely reason for a segmentation fault, like in malloc()
returning NULL
and the caller failing to check the return value), it is probably because some of the lines are still too big. 因此,如果内存不足(这可能是分段错误的原因,比如在
malloc()
返回NULL
并且调用者未能检查返回值),可能是因为某些行仍然太大。
If you run a GNU/something system, you can run wc -L huge1.txt
to check the length of the longest line. 如果运行GNU / something系统,可以运行
wc -L huge1.txt
来检查最长行的长度。
If you do have a very long line, either it is a problem with the file and you can just fix it, or you will need to resort to reading the file block by block instead of line by line, using f.read(2**20)
如果你有一个非常长的行,要么它是文件的问题而你可以修复它,或者你需要使用
f.read(2**20)
逐块而不是逐行读取文件f.read(2**20)
And if you feel like helping the python developers, you could submit a bug report as well. 如果您想帮助python开发人员,您也可以提交错误报告。 The interpreter should never segfault.
解释器绝不应该是段错误的。
Try/except will give you an idea where the problem is 尝试/除了会让你知道问题所在
with open("huge1.txt", "r") as f:
ctr=0
previous=""
try:
for line in f:
ctr += 1
previous=line
except:
print(ctr, previous)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.