简体繁体 English

逐行读取文件还是存储在内存中？

[英]Read file line-by-line or store in memory?

原文 2014-07-07 08:38:43 9 2 python/ file/ ram

This is less of a "my code's broken" question and more of a "should I do this?" 这不是一个“我的代码已损坏”的问题，而是一个“我应该这样做”的问题吗？ question. 题。

I have a script that iterates line-by-line using somethting like this: reader = csv.DictReader(open('file.txt', 'rb'), delimiter= '\\t') and gets things like ages and dates without committing the whole thing to memory. 我有一个脚本，迭代行由行使用somethting这样的： reader = csv.DictReader(open('file.txt', 'rb'), delimiter= '\\t')并把事情像年龄和日期不将整个事情都提交到内存中。

As it stands, the script uses about 5% of my RAM (8GB). 就目前而言，脚本使用了大约5％的RAM （8GB）。

In general, is it more accepted to put a file into memory instead of opening it and looping through its contents -- especially if it's large (over 700MB)? 总的来说，将文件放到内存中而不是打开文件并遍历其内容是否更被人们接受？特别是如果文件很大（超过700MB）？

My script is for personal use, but I'd rather learn Python's conventions and do what's considered acceptable. 我的脚本供个人使用，但是我宁愿学习Python的约定并做被认为可以接受的事情。 For example, I know that if I were doing something similar in JavaScript I'd try to conserve memory as much as possible to prevent browsers from crashing or becoming unresponsive. 例如，我知道如果我在JavaScript中做类似的事情，我会尽力节省内存，以防止浏览器崩溃或无响应。

Is a method (memory vs looping) preferred over another in Python? 在Python中，是否首选方法（内存还是循环）？

edit: I'm aware this could be kind of broad. 编辑：我知道这可能是广泛的。 I'm more curious as to the best (Pythonic) practice. 我对最佳（Pythonic）做法感到好奇。

There seems to be a lot of posts asking how to do it, but not a lot asking why or if . 似乎有很多帖子问如何做到这一点，但很少有人问为什么或是否如此 。

2 个解决方案

AFAIK, your method is the pythonic way to do this. AFAIK，您的方法是执行此操作的pythonic方法。

You should be aware of the fact that open('file.txt') does not put the whole file into memory. 您应该意识到以下事实： open('file.txt')不会将整个文件放入内存。 It returns an iterator which reads the file on demand. 它返回一个迭代器，该迭代器按需读取文件。 So does your DictReader . 您的DictReader 。

Just try processing a large file, you won't see any increase in memory consumption. 只是尝试处理一个大文件，您不会发现内存消耗有任何增加。

Most of the time, it's better to process the file as you read it. 大多数时候，最好在读取文件时对其进行处理。 The operating system expects such behaviour so it reads ahead a bit to compensate for the latency of the disk system. 操作系统期望这种行为，因此它会提前读取以补偿磁盘系统的延迟。 Loading the file in its entirety will normally reserve the memory used for only your process which is wasteful if you're only scanning through it once. 通常，整个加载文件只会保留仅用于您的进程的内存，如果您只扫描一次，则很浪费。 You could mmap it, which lets the system use disk buffers directly, but that loses the hint of where you will be reading next. 您可以映射它，这使系统可以直接使用磁盘缓冲区，但是这样一来，您就看不到下一步要读的位置了。 Reading too small chunks causes the system call overhead to dominate so you'll want to read fairly large chunks if possible, but for most programs the default buffering while reading lines is sufficient. 读取太小的块会导致系统调用开销占主导地位，因此，如果可能，您将希望读取相当大的块，但是对于大多数程序而言，读取行时的默认缓冲就足够了。