简体   繁体   中英

Reading a huge file in Python: Why am I getting a Segmentation Fault?

I know I shouldn't read the whole file into memory at once, but I'm not doing that.

I thought maybe I was doing something memory-heavy inside the loop, and got rid of everything until I was left with this:

with open("huge1.txt", "r") as f:
    for line in f:
        pass

It gave me a Segmentation Fault.

If I got everything right, iterating over a file like that is lazy and shouldn't load more than one line at a time into memory.

I also tried using islice , but with the same results.

My file is line based, the lines are all short and the size of the file is around 6 GB.

What am I missing?

A segmentation fault should not occur no matter what, because python interpreter should catch errors and raise exceptions in the language. So your python interpreter has a bug for sure.

Now, as for what could trigger the bug. You read the file line by line, discarding each line once you have read the next line (actually retaining 2 lines at a time, because the previous line cannot be discarded until the assignment of the next line is complete).

So, if it runs out of memory (which is a likely reason for a segmentation fault, like in malloc() returning NULL and the caller failing to check the return value), it is probably because some of the lines are still too big.

If you run a GNU/something system, you can run wc -L huge1.txt to check the length of the longest line.

If you do have a very long line, either it is a problem with the file and you can just fix it, or you will need to resort to reading the file block by block instead of line by line, using f.read(2**20)

And if you feel like helping the python developers, you could submit a bug report as well. The interpreter should never segfault.

Try/except will give you an idea where the problem is

with open("huge1.txt", "r") as f:
    ctr=0
    previous=""
    try:
        for line in f:
            ctr += 1
            previous=line
    except:
        print(ctr, previous)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM