简体   繁体   English

用Python读取一个巨大的文件:为什么我会遇到分段错误?

[英]Reading a huge file in Python: Why am I getting a Segmentation Fault?

I know I shouldn't read the whole file into memory at once, but I'm not doing that. 我知道我不应该立刻将整个文件读入内存,但我不这样做。

I thought maybe I was doing something memory-heavy inside the loop, and got rid of everything until I was left with this: 我想也许我在循环中做了一些记忆沉重的事情,然后摆脱了一切,直到我离开这个:

with open("huge1.txt", "r") as f:
    for line in f:
        pass

It gave me a Segmentation Fault. 它给了我一个分段错误。

If I got everything right, iterating over a file like that is lazy and shouldn't load more than one line at a time into memory. 如果我把一切都搞定了,那么迭代这样的文件是懒惰的,不应该一次加载多行到内存中。

I also tried using islice , but with the same results. 我也尝试过使用islice ,但结果相同。

My file is line based, the lines are all short and the size of the file is around 6 GB. 我的文件是基于行的,行都很短,文件的大小约为6 GB。

What am I missing? 我错过了什么?

A segmentation fault should not occur no matter what, because python interpreter should catch errors and raise exceptions in the language. 无论如何都不应该发生分段错误,因为python解释器应该捕获错误并在语言中引发异常。 So your python interpreter has a bug for sure. 所以你的python解释器肯定有一个bug。

Now, as for what could trigger the bug. 现在,至于什么可以触发这个bug。 You read the file line by line, discarding each line once you have read the next line (actually retaining 2 lines at a time, because the previous line cannot be discarded until the assignment of the next line is complete). 您逐行读取文件,一旦读取下一行就丢弃每一行(实际上一次保留2行,因为在完成下一行的分配之前不能丢弃前一行)。

So, if it runs out of memory (which is a likely reason for a segmentation fault, like in malloc() returning NULL and the caller failing to check the return value), it is probably because some of the lines are still too big. 因此,如果内存不足(这可能是分段错误的原因,比如在malloc()返回NULL并且调用者未能检查返回值),可能是因为某些行仍然太大。

If you run a GNU/something system, you can run wc -L huge1.txt to check the length of the longest line. 如果运行GNU / something系统,可以运行wc -L huge1.txt来检查最长行的长度。

If you do have a very long line, either it is a problem with the file and you can just fix it, or you will need to resort to reading the file block by block instead of line by line, using f.read(2**20) 如果你有一个非常长的行,要么它是文件的问题而你可以修复它,或者你需要使用f.read(2**20)逐块而不是逐行读取文件f.read(2**20)

And if you feel like helping the python developers, you could submit a bug report as well. 如果您想帮助python开发人员,您也可以提交错误报告。 The interpreter should never segfault. 解释器绝不应该是段错误的。

Try/except will give you an idea where the problem is 尝试/除了会让你知道问题所在

with open("huge1.txt", "r") as f:
    ctr=0
    previous=""
    try:
        for line in f:
            ctr += 1
            previous=line
    except:
        print(ctr, previous)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我在使用 python3 C-API 和 c FILE 时遇到分段错误 * - Why I'm getting segmentation fault with python3 C-API and c FILE * 为什么在Python中读取csv文件时会得到字符串而不是字典? - Why am I getting a string instead of a dictionary while reading a csv file in Python? 为什么在使用 Python 的 configparser 读取 ini 文件时出现 KeyError? - Why am I getting a KeyError when reading an ini file with Python's configparser? 为什么在使用 Python 脚本读取或写入 Hadoop 文件系统时会出现这些奇怪的连接错误? - Why am I getting these strange connection errors when reading or writing to Hadoop file system with a Python script? 为什么在读取 python 中的文件时出现“错误标记数据”? - Why am i getting“Error Tokenizing Data” while reading a file in python? 为什么在读取 .json 文件时会出现此错误? - why am i getting this error when reading a .json file? 为什么我在 jupyter 中读取 csv 文件时出现此错误 - Why I am getting this error in reading csv file in jupyter 用Python读取巨大的文件 - Reading Huge File in Python Python-读取大文件 - Python - reading huge file 为什么在C语言中出现“分段错误”,但在Python中没有内存泄漏? - Why do I have a “segmentation fault” in C but no memory leak in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM