简体   繁体   English

Python readlines() 用法和高效阅读练习

[英]Python readlines() usage and efficient practice for reading

I have a problem to parse 1000's of text files(around 3000 lines in each file of ~400KB size ) in a folder.我在解析文件夹中的 1000 个文本文件(每个大约 400KB 大小的文件中大约 3000 行)时遇到问题。 I did read them using readlines,我确实使用 readlines 阅读了它们,

   for filename in os.listdir (input_dir) :
       if filename.endswith(".gz"):
          f = gzip.open(file, 'rb')
       else:
          f = open(file, 'rb')

       file_content = f.readlines()
       f.close()
   len_file = len(file_content)
   while i < len_file:
       line = file_content[i].split(delimiter) 
       ... my logic ...  
       i += 1  

This works completely fine for sample from my inputs (50,100 files) .这对于来自我的输入(50,100 个文件)的样本完全适用。 When I ran on the whole input more than 5K files, the time-taken was nowhere close to linear increment.I planned to do an performance analysis and did a Cprofile analysis.当我在整个输入上运行超过 5K 个文件时,所花费的时间远不及线性增量。我打算做一个性能分析,并做了一个 Cprofile 分析。 The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files.当输入达到 7K 文件时,更多文件所花费的时间呈指数增长并达到更差的速率。

Here is the the cumulative time-taken for readlines , first -> 354 files(sample from input) and second -> 7473 files (whole input)这是 readlines 的累积时间,第一个 -> 354 个文件(来自输入的样本)和第二个 -> 7473 个文件(整个输入)

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 354    0.192    0.001    **0.192**    0.001 {method 'readlines' of 'file' objects}
 7473 1329.380    0.178  **1329.380**    0.178 {method 'readlines' of 'file' objects}

Because of this, the time-taken by my code is not linearly scaling as the input increases.因此,我的代码所花费的时间不会随着输入的增加而线性缩放。 I read some doc notes on readlines() , where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read() .我读了一些关于readlines()文档说明,人们声称这个readlines()将整个文件内容读入内存,因此与readline()read()相比,通常消耗更多的内存。

I agree with this point, but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?我同意这一点,但是垃圾收集器是否应该在循环结束时自动清除内存中加载的内容,因此在任何时候我的内存都应该只有我当前处理的文件的内容,对吗? But, there is some catch here.但是,这里有一些问题。 Can somebody give some insights into this issue.有人可以对这个问题提供一些见解。

Is this an inherent behavior of readlines() or my wrong interpretation of python garbage collector.这是readlines()的固有行为还是我对 python 垃圾收集器的错误解释。 Glad to know.很高兴知道。

Also, suggest some alternative ways of doing the same in memory and time efficient manner.另外,建议一些在内存和时间效率方面做同样事情的替代方法。 TIA. TIA。

The short version is: The efficient way to use readlines() is to not use it.简短的版本是: 使用readlines()的有效方法是不使用它。 Ever. 曾经。


I read some doc notes on readlines() , where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().我阅读了一些关于readlines()文档说明,人们声称这个readlines()将整个文件内容读入内存,因此与 readline() 或 read() 相比,通常消耗更多的内存。

The documentation for readlines() explicitly guarantees that it reads the whole file into memory, and parses it into lines, and builds a list full of str ings out of those lines. readlines()的文档明确保证它将整个文件读入内存,并将其解析为行,并从这些行中构建一个充满strlist

But the documentation forread() likewise guarantees that it reads the whole file into memory, and builds a str ing, so that doesn't help.但是read()的文档同样保证它将整个文件读入内存,并构建一个str ,所以这无济于事。


On top of using more memory, this also means you can't do any work until the whole thing is read.除了使用更多内存之外,这也意味着在读取整个内容之前您无法做任何工作。 If you alternate reading and processing in even the most naive way, you will benefit from at least some pipelining (thanks to the OS disk cache, DMA, CPU pipeline, etc.), so you will be working on one batch while the next batch is being read.如果您以最幼稚的方式交替读取和处理,您将至少受益于一些流水线(感谢 OS 磁盘缓存、DMA、CPU 流水线等),因此您将在处理一个批次的同时处理下一个批次正在阅读。 But if you force the computer to read the whole file in, then parse the whole file, then run your code, you only get one region of overlapping work for the entire file, instead of one region of overlapping work per read.但是如果你强制计算机读入整个文件,然后解析整个文件,然后运行你的代码,你只会得到整个文件的一个重叠工作区域,而不是每次读取一个重叠工作区域。


You can work around this in three ways:您可以通过三种方式解决此问题:

  1. Write a loop around readlines(sizehint) , read(size) , or readline() .围绕readlines(sizehint)read(size)readline()编写一个循环。
  2. Just use the file as a lazy iterator without calling any of these.只需将该文件用作惰性迭代器,而无需调用其中任何一个。
  3. mmap the file, which allows you to treat it as a giant string without first reading it in. mmap文件,这允许您将其视为一个巨大的字符串,而无需先读入它。

For example, this has to read all of foo at once:例如,这必须一次读取所有foo

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

But this only reads about 8K at a time:但这一次只能读取大约 8K:

with open('foo') as f:
    while True:
        lines = f.readlines(8192)
        if not lines:
            break
        for line in lines:
            pass

And this only reads one line at a time—although Python is allowed to (and will) pick a nice buffer size to make things faster.这一次只能读取一行——尽管 Python 被允许(并且将会)选择一个合适的缓冲区大小来加快速度。

with open('foo') as f:
    while True:
        line = f.readline()
        if not line:
            break
        pass

And this will do the exact same thing as the previous:这将做与之前完全相同的事情:

with open('foo') as f:
    for line in f:
        pass

Meanwhile:同时:

but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?但是垃圾收集器是否应该在循环结束时自动清除内存中加载的内容,因此在任何时候我的内存都应该只有我当前处理的文件的内容,对吗?

Python doesn't make any such guarantees about garbage collection. Python 不对垃圾收集做出任何此类保证。

The CPython implementation happens to use refcounting for GC, which means that in your code, as soon as file_content gets rebound or goes away, the giant list of strings, and all of the strings within it, will be freed to the freelist, meaning the same memory can be reused again for your next pass. CPython 实现碰巧使用 GC 的引用计数,这意味着在您的代码中,一旦file_content被反弹或消失,巨大的字符串列表以及其中的所有字符串将被释放到空闲列表,这意味着相同的内存可以再次用于您的下一次传递。

However, all those allocations, copies, and deallocations aren't free—it's much faster to not do them than to do them.然而,所有这些分配、复制和释放都不是免费的——不做比做要快得多。

On top of that, having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior.最重要的是,让你的字符串分散在一大片内存中而不是一遍又一遍地重用相同的小内存块会损害你的缓存行为。

Plus, while the memory usage may be constant (or, rather, linear in the size of your largest file, rather than in the sum of your file sizes), that rush of malloc s to expand it the first time will be one of the slowest things you do (which also makes it much harder to do performance comparisons).另外,虽然内存使用可能是恒定的(或者,更确切地说,与最大文件的大小成线性关系,而不是文件大小的总和),但malloc第一次扩展它的冲动将是其中之一你做的最慢的事情(这也使得进行性能比较变得更加困难)。


Putting it all together, here's how I'd write your program:综上所述,我将如何编写您的程序:

for filename in os.listdir(input_dir):
    with open(filename, 'rb') as f:
        if filename.endswith(".gz"):
            f = gzip.open(fileobj=f)
        words = (line.split(delimiter) for line in f)
        ... my logic ...  

Or, maybe:或者,也许:

for filename in os.listdir(input_dir):
    if filename.endswith(".gz"):
        f = gzip.open(filename, 'rb')
    else:
        f = open(filename, 'rb')
    with contextlib.closing(f):
        words = (line.split(delimiter) for line in f)
        ... my logic ...

Read line by line, not the whole file:逐行读取,而不是整个文件:

for line in open(file_name, 'rb'):
    # process line here

Even better use with for automatically closing the file:更好地使用with自动关闭文件:

with open(file_name, 'rb') as f:
    for line in f:
        # process line here

The above will read the file object using an iterator, one line at a time.上面将使用迭代器读取文件对象,一次一行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM