简体   繁体   English

Python:itertools.islice无法循环工作

[英]Python: itertools.islice not working in a loop

I have code like this: 我有这样的代码:

#opened file f
goto_line = num_lines #Total number of lines
while not found:
   line_str = next(itertools.islice(f, goto_line - 1, goto_line))
   goto_line = goto_line/2
   #checks for data, sets found to True if needed

line_str is correct the first pass, but every pass after that is reading a different line then it should. 第一行的line_str是正确的,但此后的每一遍都将读取不同的行。

So for example, goto_line starts off as 1000. It reads line 1000 just fine. 因此,例如,goto_line以1000开始。它读取1000行就好了。 Then the next loop, goto_line is 500 but it doesn't read line 500. It reads some line closer to 1000. 然后,下一个循环goto_line是500,但不会读取第500行。它读取的行更接近1000。

I'm trying to read specific lines in a large file without reading more than necessary. 我正在尝试读取大文件中的特定行,而不读取多余的内容。 Sometimes it jumps backwards to a line and sometimes forward. 有时它会向后跳到一条线,有时会跳到一条线。

I did try linecache, but I typically don't run this code more than once on the same file. 我确实尝试过线缓存,但通常不会在同一文件上多次运行此代码。

Python iterators can be consumed only once. Python迭代器只能使用一次。 This is easiest seen by example. 通过示例最容易看出这一点。 The following code 以下代码

from itertools import islice
a = range(10)
i = iter(a)
print list(islice(i, 1, 3))
print list(islice(i, 1, 3))
print list(islice(i, 1, 3))
print list(islice(i, 1, 3))

prints 版画

[1, 2]
[4, 5]
[7, 8]
[]

The slicing always starts where we stopped last time. 切片总是从上次停止的地方开始。

The easiest way to make your code work is to use the f.readlines() to get a list of the lines in the file and then use normal Python list slicing [i:j] . 使代码工作最简单的方法是使用f.readlines()获取文件中的行列表,然后使用常规的Python列表切片[i:j] If you really want to use islice() , you could start reading the file from the beginning each time by using f.seek(0) , but this will be very inefficient. 如果您确实想使用islice() ,则可以每次使用f.seek(0)从头开始读取文件,但这效率非常低。

You cannot (this way - perhaps there is some way depending on how the file is opened) go back in the file. 您不能(通过这种方式-可能有某种方式取决于文件的打开方式)返回文件。 The standard file iterator (in fact, most iterators - Python's iterator protocol only supports forward iterators) moves only forward. 标准文件迭代器(实际上,大多数迭代器-Python的迭代器协议仅支持正向迭代器) 向前移动。 So after reading k lines, reading another k/2 lines actually gives the k+k/2 th line. 因此,在读取了k条线之后,再读取另外k/2条线实际上就是第k+k/2条线。

You could try reading the whole file into memory, but you have a lot of data so memory consumption propably becomes an issue. 可以尝试将整个文件读取到内存中,但是您有很多数据,因此内存消耗可能成为问题。 You could use file.seek to scroll through the file. 您可以使用file.seek滚动浏览文件。 But that's still a lot of work - perhaps you could use a memory-mapped file ? 但这仍然是很多工作-也许您可以使用内存映射文件 That's only possible if lines are fixed-size though. 但是,只有在行大小固定的情况下才有可能。 If it's necessary, you could pre-calculate the line numbers you'd like to check and save all those lines (shouldn't be too much, roughly int(log_2(line_count)) + 1 if I'm not mistaken) in one iteration so you don't have to scroll back after reading the whole file. 如果有必要,您可以预先计算要检查的行号,然后将所有这些行保存(不要太多,如果我没记错的话,大致应为int(log_2(line_count)) + 1 )。迭代,因此您不必在读取整个文件后向后滚动。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM