简体繁体 English

在Python中有效地将数千兆字节的数据写入磁盘

[英]Efficiently write gigabytes of data to disk in Python

原文 2014-03-30 20:18:49 8 2 python/ performance

On Python v2.7 in Windows and Linux, what is the most efficient and quick way to sequentially write 5GB of data to a local disk (fixed or removable)? 在Windows和Linux中的Python v2.7上，将5GB数据顺序写入本地磁盘（固定或可移动）的最有效，最快捷的方法是什么？ This data will not soon be read and does not need cached. 此数据将不会很快被读取，不需要缓存。

It seems the normal ways of writing use the OS disk cache (because the system assumes it may re-read this data soon). 似乎正常的写入方式使用OS磁盘缓存（因为系统假设它可能很快会重新读取此数据）。 This clears useful data of of the cache, making the system slower. 这会清除缓存的有用数据，从而使系统变慢。

Right now I am using f.write() with 65535 bytes of data at a time. 现在，我一次使用f.write（）处理65535个字节的数据。

2 个解决方案

The real reason your OS uses the disk cache isn't that it assumes the data will be re-read -- it's that it wants to speed up the writes . 您的操作系统使用磁盘缓存的真正原因并不是因为它假定将重新读取数据，而是因为它想加快写入速度。 You want to use the OS's write cache as aggressively as you possibly can. 您希望尽可能积极地使用操作系统的写缓存。

That being said, the "standard" way to do high-performance, high-volume I/O in any language (and probably the most aggressive way to use the OS's read/write caches) is to use memory-mapped I/O. 话虽这么说，以任何语言进行高性能，大容量I / O的“标准”方法（可能是使用OS的读/写缓存的最积极的方法）是使用内存映射的I / O。 The mmap module ( https://docs.python.org/2/library/mmap.html ) will provide that, and depending on how you generate your data in the first place, you might even be able to gain more performance by dumping it to the buffer earlier. mmap模块（ https://docs.python.org/2/library/mmap.html ）将提供此功能，并且取决于您首先生成数据的方式，您甚至可以通过转储获得更多性能它更早地发送到缓冲区。

Note that with a dataset as big as yours, it'll only work on a 64-bit machine (Python's mmap on 32-bit is limited to 4GiB buffers). 请注意，使用与您的数据集一样大的数据集，它只能在64位计算机上使用（Python在32位计算机上的mmap限于4GiB缓冲区）。

If you want more specific advice, you'll have to give us more info on how you generate your data. 如果您需要更具体的建议，则必须向我们提供有关如何生成数据的更多信息。

This answer is relevant for Windows code, I have no idea about the Linux equivalent though I imagine the advice is similar. 这个答案与Windows代码有关，尽管我认为建议是相似的，但我对Linux等效代码一无所知。

If you want to be write the fastest code possible, then write using the Win32API and make sure you read the relevant section of CreateFile . 如果要编写最快的代码，请使用Win32API进行编写，并确保已阅读CreateFile的相关部分。 Specifically make sure you do not make the classic mistake of using the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags to open a file, for more explanation see Raymond Chen's classic blog post. 特别要确保您不要犯使用FILE_FLAG_NO_BUFFERING和FILE_FLAG_WRITE_THROUGH标志来打开文件的经典错误，有关更多说明，请参见Raymond Chen的经典博客文章。

If you insist of writing at some multiple of sector or cluster size, then don't be beholden to the magic number of 65535 (why this number? It's no real multiple). 如果您坚持写扇区或簇大小的倍数，那么不要迷惑65535的神奇数字（为什么这个数字呢？不是真正的倍数）。 Instead using GetDiskFreeSpace figure out the appropriate sector size, though even this is no real guarantee (some data may be kept with the NTFS file information). 而是使用GetDiskFreeSpace找出适当的扇区大小，即使这并不是真正的保证（某些数据可能与NTFS文件信息一起保留）。