简体繁体 English

保存大型 numpy 二维数组

[英]Saving large numpy 2d arrays

原文 2019-08-27 17:48:12 3 1 python/ numpy

I have an array with ~1,000,000 rows, each of which is a numpy array of 4,800 float32 numbers.我有一个包含约 1,000,000 行的数组，每行都是一个包含 4,800 个 float32 数字的 numpy 数组。 I need to save this as a csv file, however using numpy.savetxt has been running for 30 minutes and I don't know how much longer it will run for.我需要将它保存为 csv 文件，但是使用 numpy.savetxt 已经运行了 30 分钟，我不知道它会运行多长时间。 Is there a faster method of saving the large array as a csv?有没有更快的方法将大数组保存为 csv？ Many thanks, Josh非常感谢，乔希

1 个解决方案

As pointed out in the comments, 1e6 rows * 4800 columns * 4 bytes per float32 is 18GiB.正如评论中指出的那样，1e6 行 * 4800 列 * 每个 float32 4 个字节是 18GiB。 Writing a float to text takes ~9 bytes of text (estimating 1 for integer, 1 for decimal, 5 for mantissa and 2 for separator), which comes out to 40GiB.将浮点数写入文本需要大约 9 个字节的文本（估计 1 表示整数，1 表示小数，5 表示尾数，2 表示分隔符），结果为 40GiB。 This takes a long time to do, since just the conversion to text itself is non-trivial, and disk I/O will be a huge bottle-neck.这需要很长时间才能完成，因为仅转换为文本本身就很重要，而且磁盘 I/O 将是一个巨大的瓶颈。

One way to optimize this process may be to convert the entire array to text on your own terms, and write it in blocks using Python's binary I/O.优化此过程的一种方法可能是根据您自己的条件将整个数组转换为文本，并使用 Python 的二进制 I/O 将其写入块中。 I doubt that will give you too much benefit though.我怀疑这会给你带来太多好处。

A much better solution would be to write the binary data to a file instead of text.更好的解决方案是将二进制数据写入文件而不是文本。 Aside from the obvious advantages of space and speed, binary has the advantage of being searchable and not requiring transformation after loading.除了空间和速度的明显优势外，二进制还具有可搜索和加载后不需要转换的优势。 You know where every individual element is in the file, if you are clever, you can access portions of the file without loading the entire thing.您知道文件中每个单独元素的位置，如果您很聪明，您可以访问文件的部分内容而无需加载整个内容。 Finally, a binary file is more likely to be highly compressible than a relatively low-entropy text file.最后，二进制文件比相对低熵的文本文件更有可能是高度可压缩的。

Disadvantages of binary are that it is not human-readable, and not as portable as text.二进制的缺点是它不是人类可读的，并且不像文本那样可移植。 The latter is not a problem, since transforming into an acceptable format will be trivial.后者不是问题，因为转换为可接受的格式将是微不足道的。 The former is likely a non-issue given the amount of data you are attempting to process anyway.考虑到您无论如何要处理的数据量，前者可能不是问题。

Keep in mind that human readability is a relative term.请记住，人类可读性是一个相对术语。 A human can not read 40iGB of numerical data with understanding.人类无法通过理解读取 40iGB 的数值数据。 A human can process A) a graphical representation of the data, or B) scan through relatively small portions of the data.人类可以处理 A) 数据的图形表示，或 B) 扫描数据的相对较小部分。 Both cases are suitable for binary representations.这两种情况都适用于二进制表示。 Case A) is straightforward: load, transform and plot the data.案例 A) 很简单：加载、转换和绘制数据。 This will be much faster if the data is already in a binary format that you can pass directly to the analysis and plotting routines.如果数据已经是可以直接传递给分析和绘图程序的二进制格式，这会快得多。 Case B) can be handled with something like a memory mapped file.情况 B) 可以用内存映射文件之类的东西来处理。 You only ever need to load a small portion of the file, since you can't really show more than say a thousand elements on screen at one time anyway.您只需要加载文件的一小部分，因为无论如何您不能一次在屏幕上显示超过一千个元素。 Any reasonable modern platform should be able to keep upI/O and binary-to-text conversion associated with a user scrolling around a table widget or similar.任何合理的现代平台都应该能够保持与用户滚动表格小部件或类似物相关的 I/O 和二进制到文本的转换。 In fact, binary makes it easier since you know exactly where each element belongs in the file.事实上，二进制更容易，因为您确切地知道每个元素在文件中的位置。