简体   繁体   English

在Python中从文件(但不是全部)读取n行

[英]Reading n lines from file (but not all) in Python

How to read n lines from a file instead of just one when iterating over it? 如何在迭代时从文件中读取n行而不是只读一行? I have a file which has well defined structure and I would like to do something like this: 我有一个具有良好定义结构的文件,我想做这样的事情:

for line1, line2, line3 in file:
    do_something(line1)
    do_something_different(line2)
    do_something_else(line3)

but it doesn't work: 但它不起作用:

ValueError: too many values to unpack ValueError:要解压缩的值太多

For now I am doing this: 现在我这样做:

for line in file:
    do_someting(line)
    newline = file.readline()
    do_something_else(newline)
    newline = file.readline()
    do_something_different(newline)
... etc.

which sucks because I am writing endless ' newline = file.readline() ' which are cluttering the code. 这很糟糕,因为我正在编写无休止的' newline = file.readline() ',它们使代码混乱。 Is there any smart way to do this ? 有没有聪明的方法来做到这一点? (I really want to avoid reading whole file at once because it is huge) (我真的想避免一次读取整个文件,因为它很大)

Basically, your file is an iterator which yields your file one line at a time. 基本上,您的file是一个迭代器,它一次产生一行文件。 This turns your problem into how do you yield several items at a time from an iterator. 这会将您的问题转化为如何从迭代器一次产生多个项目。 A solution to that is given in this question . 这个问题中给出了解决方案。 Note that the function islice is in the itertools module so you will have to import it from there. 请注意,函数islice位于itertools模块中,因此您必须从那里导入它。

如果它是xml为什么不只是使用lxml?

You could use a helper function like this: 您可以使用这样的辅助函数:

def readnlines(f, n):
    lines = []
    for x in range(0, n):
        lines.append(f.readline())
    return lines

Then you can do something like you want: 然后你可以做你想要的事情:

while True:
    line1, line2, line3 = readnlines(file, 3)
    do_stuff(line1)
    do_stuff(line2)
    do_stuff(line3)

That being said, if you are using xml files, you will probably be happier in the long run if you use a real xml parser... 话虽这么说,如果你使用的是xml文件,如果使用真正的xml解析器,从长远来看你可能会更开心......

itertools to the rescue: itertools救援:

import itertools
def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.izip_longest(fillvalue=fillvalue, *args)


fobj= open(yourfile, "r")
for line1, line2, line3 in grouper(3, fobj):
    pass

for i in file produces a str , so you can't just do for i, j, k in file and read it in batches of three (try a, b, c = 'bar' and a, b, c = 'too many characters' and look at the values of a, b and c to work out why you get the "too many values to unpack"). for i in file产生一个str ,所以你不能只for i, j, k in filefor i, j, k in file并分三批读取它(尝试a, b, c = 'bar'a, b, c = 'too many characters'并查看a,b和c的值,以找出为什么你得到“太多的值来解压”)。

It's not clear entirely what you mean, but if you're doing the same thing for each line and just want to stop at some point, then do it like this: 目前还不清楚你的意思,但是如果你为每一行做同样的事情而只是想在某个时刻停下来,那么就这样做:

for line in file_handle:
    do_something(line)
    if some_condition:
        break  # Don't want to read anything else

(Also, don't use file as a variable name, you're shadowning a builtin.) (另外,不要将file用作变量名,而是要对内置函数进行着色。)

If your're doing the same thing why do you need to process multiple lines per iteration? 如果您正在做同样的事情,为什么每次迭代需要处理多行?

For line in file is your friend. 对于文件行是你的朋友。 It is in general much more efficient than manually reading the file, both in terms of io performance and memory. 在io性能和内存方面,它通常比手动读取文件更有效。

Do you know something about the length of the lines/format of the data? 您对数据的行/格式的长度有所了解吗? If so, you could read in the first n bytes (say 80*3) and f.read(240).split("\\n")[0:3]. 如果是这样,您可以读取前n个字节(例如80 * 3)和f.read(240).split(“\\ n”)[0:3]。

If you want to be able to use this data over and over again, one approach might be to do this: 如果您希望能够反复使用此数据,可能需要采取以下措施:

lines = []
for line in file_handle:
    lines.append(line)

This will give you a list of the lines, which you can then access by index. 这将为您提供行列表,然后您可以通过索引访问这些行。 Also, when you say a HUGE file, it is most likely trivial what the size is, because python can process thousands of lines very quickly. 此外,当你说一个巨大的文件时,它的大小很可能是微不足道的,因为python可以非常快速地处理数千行。

why can't you just do: 为什么你不能这样做:

ctr = 0 ctr = 0

for line in file: 对于文件中的行:

  if ctr == 0:

     ....

  elif ctr == 1:

     ....

  ctr = ctr + 1

if you find the if/elif construct ugly you could just create a hash table or list of function pointers and then do: 如果你发现if / elif构造很难看,你可以创建一个哈希表或函数指针列表,然后执行:

for line in file: 对于文件中的行:

   function_list[ctr]()

or something similar 或类似的东西

It sounds like you are trying to read from disk in parallel... that is really hard to do. 听起来你正试图从磁盘并行读取......这真的很难做到。 All the solutions given to you are realistic and legitimate. 给予您的所有解决方案都是现实和合法的。 You shouldn't let something put you off just because the code "looks ugly". 你不应该因为代码“看起来很难看”而让某些事情让你失望。 The most important thing is how efficient/effective is it, then if the code is messy, you can tidy it up, but don't look for a whole new method of doing something because you don't like how one way of doing it looks like in code. 最重要的是它是多么高效/有效,那么如果代码混乱,你可以整理它,但不要寻找一种全新的做法,因为你不喜欢一种方法在代码中看起来像。

As for running out of memory, you may want to check out pickle . 至于耗尽内存,你可能想看看泡菜

It's possible to do it with a clever use of the zip function. 巧妙地使用zip功能可以做到这一点。 It's short, but a bit voodoo-ish for my tastes (hard to see how it works). 它很短,但有点巫术 - 我的口味(很难看出它是如何工作的)。 It cuts off any lines at the end that don't fill a group, which may be good or bad depending on what you're doing. 它会切断最后一些不填充组的行,这可能是好的还是坏的,这取决于你正在做什么。 If you need the final lines, itertools.izip_longest might do the trick. 如果你需要最后一行, itertools.izip_longest可能会做到这一点。

zip(*[iter(inputfile)] * 3)

Doing it more explicitly and flexibly, this is a modification of Mats Ekberg's solution: 更明确,更灵活地进行,这是对Mats Ekberg解决方案的修改:

def groupsoflines(f, n):
    while True:
        group = []
        for i in range(n):
            try:
                group.append(next(f))
            except StopIteration:
                if group:
                    tofill = n - len(group)
                    yield group + [None] * tofill
                return
        yield group

for line1, line2, line3 in groupsoflines(inputfile, 3):
    ...

NB If this runs out of lines halfway through a group, it will fill in the gaps with None , so that you can still unpack it. 注意如果组中途的线条用完了,它将用None填充间隙,这样你仍然可以解压缩它。 So, if the number of lines in your file might not be a multiple of three, you'll need to check whether line2 and line3 are None . 因此,如果文件中的行数可能不是三的倍数,则需要检查line2line3是否为None

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM