简体   繁体   English

我应该如何在 Python 中逐行读取文件?

[英]How should I read a file line-by-line in Python?

In pre-historic times (Python 1.4) we did:在史前时代(Python 1.4),我们做了:

fp = open('filename.txt')
while 1:
    line = fp.readline()
    if not line:
        break
    print(line)

after Python 2.1, we did:在 Python 2.1 之后,我们做了:

for line in open('filename.txt').xreadlines():
    print(line)

before we got the convenient iterator protocol in Python 2.3, and could do:在我们在 Python 2.3 中获得方便的迭代器协议之前,可以这样做:

for line in open('filename.txt'):
    print(line)

I've seen some examples using the more verbose:我见过一些使用更详细的例子:

with open('filename.txt') as fp:
    for line in fp:
        print(line)

is this the preferred method going forwards?这是前进的首选方法吗?

[edit] I get that the with statement ensures closing of the file... but why isn't that included in the iterator protocol for file objects? [编辑] 我知道 with 语句确保关闭文件......但为什么不包含在文件对象的迭代器协议中?

There is exactly one reason why the following is preferred: 以下是首选的原因之一:

with open('filename.txt') as fp:
    for line in fp:
        print line

We are all spoiled by CPython's relatively deterministic reference-counting scheme for garbage collection. 我们都被CPython的相对确定性的垃圾收集引用计数方案所破坏。 Other, hypothetical implementations of Python will not necessarily close the file "quickly enough" without the with block if they use some other scheme to reclaim memory. 如果他们使用其他方案来回收内存,那么其他假设的Python实现不一定会在没有with块的情况下“足够快”地关闭文件。

In such an implementation, you might get a "too many files open" error from the OS if your code opens files faster than the garbage collector calls finalizers on orphaned file handles. 在这样的实现中,如果您的代码打开文件的速度比垃圾收集器调用孤立文件句柄上的终结器更快,则可能会从操作系统中获得“打开太多文件”错误。 The usual workaround is to trigger the GC immediately, but this is a nasty hack and it has to be done by every function that could encounter the error, including those in libraries. 通常的解决方法是立即触发GC,但这是一个讨厌的黑客,它必须由可能遇到错误的每个函数完成,包括库中的那些。 What a nightmare. 什么样的恶梦。

Or you could just use the with block. 或者你可以使用with块。

Bonus Question 奖金问题

(Stop reading now if are only interested in the objective aspects of the question.) (如果只对问题的客观方面感兴趣,请立即停止阅读。)

Why isn't that included in the iterator protocol for file objects? 为什么不包含在文件对象的迭代器协议中?

This is a subjective question about API design, so I have a subjective answer in two parts. 这是一个关于API设计的主观问题,所以我有两个部分的主观答案。

On a gut level, this feels wrong, because it makes iterator protocol do two separate things—iterate over lines and close the file handle—and it's often a bad idea to make a simple-looking function do two actions. 在内容层面上,这感觉是错误的,因为它使迭代器协议做两个单独的事情 - 遍历线关闭文件句柄 - 并且通常做一个看起来很简单的函数执行两个操作是个坏主意。 In this case, it feels especially bad because iterators relate in a quasi-functional, value-based way to the contents of a file, but managing file handles is a completely separate task. 在这种情况下,感觉特别糟糕,因为迭代器以准功能,基于值的方式与文件内容相关,但管理文件句柄是一项完全独立的任务。 Squashing both, invisibly, into one action, is surprising to humans who read the code and makes it more difficult to reason about program behavior. 将这两者无形地拼接成一个动作对于阅读代码并使得更难以推理程序行为的人来说是令人惊讶的。

Other languages have essentially come to the same conclusion. 其他语言基本上得出了相同的结论。 Haskell briefly flirted with so-called "lazy IO" which allows you to iterate over a file and have it automatically closed when you get to the end of the stream, but it's almost universally discouraged to use lazy IO in Haskell these days, and Haskell users have mostly moved to more explicit resource management like Conduit which behaves more like the with block in Python. Haskell简单地与所谓的“懒惰IO”调情,它允许你迭代一个文件并在你到达流的末尾时自动关闭它,但是现在几乎普遍不鼓励在Haskell中使用惰性IO,而Haskell用户大多转向更明确的资源管理,如Conduit,其行为更像Python中的with块。

On a technical level, there are some things you may want to do with a file handle in Python which would not work as well if iteration closed the file handle. 在技​​术层面上,你可能想要用Python中的文件句柄做一些事情,如果迭代关闭了文件句柄,这些事情也不会有效。 For example, suppose I need to iterate over the file twice: 例如,假设我需要迭代文件两次:

with open('filename.txt') as fp:
    for line in fp:
        ...
    fp.seek(0)
    for line in fp:
        ...

While this is a less common use case, consider the fact that I might have just added the three lines of code at the bottom to an existing code base which originally had the top three lines. 虽然这是一个不太常见的用例,但考虑一下我可能刚刚在底部将三行代码添加到最初具有前三行的现有代码库这一事实。 If iteration closed the file, I wouldn't be able to do that. 如果迭代关闭了文件,我将无法做到这一点。 So keeping iteration and resource management separate makes it easier to compose chunks of code into a larger, working Python program. 因此,将迭代和资源管理分开可以更容易地将代码块组合成更大的,有效的Python程序。

Composability is one of the most important usability features of a language or API. 可组合性是语言或API最重要的可用性功能之一。

Yes, 是,

with open('filename.txt') as fp:
    for line in fp:
        print line

is the way to go. 是要走的路。

It is not more verbose. 它不是更冗长。 It is more safe. 它更安全。

if you're turned off by the extra line, you can use a wrapper function like so: 如果您被额外的行关闭,您可以使用这样的包装函数:

def with_iter(iterable):
    with iterable as iter:
        for item in iter:
            yield item

for line in with_iter(open('...')):
    ...

in Python 3.3, the yield from statement would make this even shorter: 在Python 3.3中,语句的yield from会使它更短:

def with_iter(iterable):
    with iterable as iter:
        yield from iter
f = open('test.txt','r')
for line in f.xreadlines():
    print line
f.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM