压缩测量数据文件

Question

from measurements I get text files that basically contain a table of float numbers, with the dimensions 1000x1000. 从测量中，我得到的文本文件基本上包含一个浮点数表，尺寸为1000x1000。 Those take up about 15MB of space which, considering that I get about 1000 result files in a series, is unacceptable to save. 那些占用了大约15MB的空间，考虑到我连续获得约1000个结果文件，因此无法保存。 So I am trying to compress those by as much as possible without loss of data. 因此，我正在尝试尽可能地压缩它们而不会丢失数据。 My idea is to group the numbers into ~1000 steps over the range I expect and save those. 我的想法是将数字分组到我期望的范围内的〜1000步中并保存。 That would provide sufficient resolution. 这将提供足够的分辨率。 However I still have 1.000.000 points to consider and thus my resulting file is still about 4MB. 但是，我仍然需要考虑1.000.000点，因此我生成的文件仍然约为4MB。 I probably won t be able to compress that any further? 我可能无法再压缩了吗？ The bigger problem is the calculation time this takes. 更大的问题是所需的计算时间。 Right now I d guesstimate 10-12 secs per file, so about 3 hrs for the 1000 files. 现在我估计每个文件10-12秒，因此大约1000个文件需要3个小时。 WAAAAY to much. 相当可观。 This is the algorithm I thougth up, do you have any suggestions? 这是我深思熟虑的算法，您有什么建议吗？ There's probably far more efficient algorithms to do that, but I am not much of a programmer... 可能有效率更高的算法来执行此操作，但是我不是一个程序员太多...

import numpy

data=numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
out=numpy.empty((1000,1000),numpy.int16)
i=0
min=-0.5
max=0.5
step=(max-min)/1000
while i<=999:
    j=0
    while j<=999: 
        k=(data[i,j]//step)
        out[i,j]=k
        if data[i,j]>max:
            out[i,j]=500
        if data[i,j]<min:
            out[i,j]=-500
        j=j+1
    i=i+1
numpy.savetxt('converted.txt', out, fmt="%i")

Thanks in advance for any hints you can provide! 预先感谢您提供的任何提示！ Jakob 雅各布

Answer 1

I see you store the numpy arrays as text files. 我看到您将numpy数组存储为文本文件。 There is a faster and more space-efficient way: just dump it. 有一种更快，更节省空间的方法：将其转储。

If your floats can be stored as 32-bit floats, then use this: 如果您的浮点数可以存储为32位浮点数，请使用以下代码：

data = numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)

data.astype(numpy.float32).dump(open('converted.numpy', 'wb'))

then you can read it with 然后您可以阅读

data = numpy.load(open('converted.numpy', 'rb'))

The files will be 1000x1000x4 Bytes, about 4MB. 文件将为1000x1000x4字节，约为4MB。

The latest version of numpy supports 16-bit floats. 最新版本的numpy支持16位浮点数。 Maybe your floats will fit in its limiter range. 也许您的浮标将适合其限制器范围。

Answer 2

numpy.savez_compressed will let you save lots of arrays into a single compressed, binary file. numpy.savez_compressed将使您将大量数组保存到单个压缩的二进制文件中。

However, you aren't going to be able to compress it that much -- if you have 15GB of data, you're not magically going to fit it in 200MB by compression algorithms. 但是，你是不是要能够到多少压缩它-如果你有数据的15GB，你不会奇迹般地去压缩算法，以适应它的200MB。 You have to throw out some of your data, and only you can decide how much you need to keep. 您必须丢弃一些数据，只有您才能决定需要保留多少数据。

Answer 3

Use the zipfile , bz2 or gzip module to save to a zip, bz2 or gz file from python. 使用zipfile ， bz2或gzip模块从python保存到zip，bz2或gz文件。 Any compression scheme you write yourself in a reasonable amount of time will almost certainly be slower and have worse compression ratio than these generic but optimized and compiled solutions. 与这些通用但经过优化和编译的解决方案相比，您在合理的时间内编写的任何压缩方案几乎都肯定会更慢并且压缩率更差。 Also consider taking eumiro's advice. 还可以考虑接受eumiro的建议。

压缩测量数据文件

问题描述

3 个解决方案

解决方案1
4 2011-08-04 13:35:23

解决方案2
2 2011-08-04 13:35:15

解决方案3
1 2011-08-04 13:34:06

压缩测量数据文件

问题描述

3 个解决方案

解决方案1 4 2011-08-04 13:35:23

解决方案2 2 2011-08-04 13:35:15

解决方案3 1 2011-08-04 13:34:06

解决方案1
4 2011-08-04 13:35:23

解决方案2
2 2011-08-04 13:35:15

解决方案3
1 2011-08-04 13:34:06