简体   繁体   中英

python jump to a line in a txt file (a gzipped one)

I'm reading through a large file, and processing it. I want to be able to jump to the middle of the file without it taking a long time.

right now I am doing:

f = gzip.open(input_name)
for i in range(1000000):
    f.read() # just skipping the first 1M rows

for line in f:
    do_something(line)

is there a faster way to skip the lines in the zipped file? If I have to unzip it first, I'll do that, but there has to be a way.

It's of course a text file, with \\n separating lines.

The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.

To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.

with gzip.open(filename) as f:
    # jumps to `initial_row`
    for line in itertools.slice(f, initial_row, None):
        # have a party

Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip . That would look like: parsed_csv = pd.read_csv(filename, compression='gzip') .

Also, to be extra clear, when you iterate over file objects in python -- ie like the f variable above -- you iterate over lines. You do not need to think about the '\\n' characters.

You can use itertools.islice , passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:

from itertools import islice

for line in islice(f,1000000,None):
     print(line)

Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object ie next(f) .

Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.

The consume recipe as @wwii suggested recipe is also worth checking out

Not really.

If you know the number of bytes you want to skip, you can use .seek(amount) on the file object, but in order to skip a number of lines, Python has to go through the file byte by byte to count the newline characters.

The only alternative that comes to my mind is if you handle a certain static file, that won't change. In that case, you can index it once, ie find out and remember the positions of each line. If you have that in eg a dictionary that you save and load with pickle , you can skip to it in quasi-constant time with seek .

It is not possible to randomly seek within a gzip file. Gzip is a stream algorithm and so it must always be uncompressed from the start until where your data of interest lies.

It is not possible to jump to a specific line without an index. Lines can be scanned forward or scanned backwards from the end of the file in continuing chunks.

You should consider a different storage format for your needs. What are your needs?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM