简体   繁体   English

python跳转到txt文件中的一行(一个gzipped)

[英]python jump to a line in a txt file (a gzipped one)

I'm reading through a large file, and processing it. 我正在阅读一个大文件并进行处理。 I want to be able to jump to the middle of the file without it taking a long time. 我希望能够跳到文件的中间,而不需要花费很长时间。

right now I am doing: 现在我在做:

f = gzip.open(input_name)
for i in range(1000000):
    f.read() # just skipping the first 1M rows

for line in f:
    do_something(line)

is there a faster way to skip the lines in the zipped file? 是否有更快的方法来跳过压缩文件中的行? If I have to unzip it first, I'll do that, but there has to be a way. 如果我必须首先解压缩它,我会这样做,但必须有一种方法。

It's of course a text file, with \\n separating lines. 它当然是一个文本文件, \\n分隔行。

The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. gzipping的本质是当压缩文件时不再存在行的概念 - 它只是一个二进制blob。 Check out this for an explanation of what gzip does. 看看这个解释gzip的作用。

To read the file, you'll need to decompress it -- the gzip module does a fine job of it. 要读取文件,您需要对其进行解压缩 - gzip模块可以很好地完成它。 Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible. 像其他答案一样,我也建议使用itertools进行跳跃,因为它会仔细确保你不会把东西拉进记忆中,它会尽快让你到达那里。

with gzip.open(filename) as f:
    # jumps to `initial_row`
    for line in itertools.slice(f, initial_row, None):
        # have a party

Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip . 或者,如果这是您要使用的CSV,您也可以尝试计时pandas解析,因为它可以处理解压缩gzip That would look like: parsed_csv = pd.read_csv(filename, compression='gzip') . 这看起来像: parsed_csv = pd.read_csv(filename, compression='gzip')

Also, to be extra clear, when you iterate over file objects in python -- ie like the f variable above -- you iterate over lines. 另外,要清楚一点,当你在python中迭代文件对象时 - 就像上面的f变量一样 - 你遍历行。 You do not need to think about the '\\n' characters. 您不需要考虑'\\ n'字符。

You can use itertools.islice , passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times: 你可以使用itertools.islice ,传递一个文件对象f和起点,它仍然可以推进迭代器,但比下一次调用1000000次更有效:

from itertools import islice

for line in islice(f,1000000,None):
     print(line)

Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. 不太熟悉gzip,但我想f.read()会读取整个文件,因此接下来的999999调用什么都不做。 If you wanted to manually advance the iterator you would call next on the file object ie next(f) . 如果你想手动推进迭代器,你可以在文件对象上调用next,即next(f)

Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful. 调用next(f)并不意味着所有行都会立即读入内存,它会一次使迭代器前进一行,所以如果你想跳过文件或标题中的一行或两行,它就会很有用。

The consume recipe as @wwii suggested recipe is also worth checking out 作为@wwii建议食谱的消费食谱也值得一试

Not really. 并不是的。

If you know the number of bytes you want to skip, you can use .seek(amount) on the file object, but in order to skip a number of lines, Python has to go through the file byte by byte to count the newline characters. 如果您知道要跳过的字节数,可以在文件对象上使用.seek(amount) ,但是为了跳过多行,Python必须逐字节地遍历文件以计算换行符。

The only alternative that comes to my mind is if you handle a certain static file, that won't change. 我想到的唯一选择是,如果你处理某个静态文件,那就不会改变。 In that case, you can index it once, ie find out and remember the positions of each line. 在这种情况下,您可以将其编入索引一次,即找出并记住每一行的位置。 If you have that in eg a dictionary that you save and load with pickle , you can skip to it in quasi-constant time with seek . 如果您在例如字典中保存并加载了pickle ,则可以使用seek在准恒定时间内跳过它。

It is not possible to randomly seek within a gzip file. 无法在gzip文件中随机搜索。 Gzip is a stream algorithm and so it must always be uncompressed from the start until where your data of interest lies. Gzip是一种流算法,因此必须始终从一开始就解压缩,直到您感兴趣的数据所在。

It is not possible to jump to a specific line without an index. 如果没有索引,则无法跳转到特定行。 Lines can be scanned forward or scanned backwards from the end of the file in continuing chunks. 可以向前扫描行,或者以连续的块从文件末尾向后扫描行。

You should consider a different storage format for your needs. 您应该根据需要考虑不同的存储格式。 What are your needs? 你有什么需求?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM