简体   繁体   English

在Python中将numpy数组写入文件的有效方法

[英]Efficient way of writing numpy arrays to file in Python

The data that I process is ~ 6 Million and it takes a lot of time to write to a file. 我处理的数据约为600万,需要花费大量时间才能写入文件。 How do I improve it ? 我该如何改善呢?

The following are the two approaches that I tried: 以下是我尝试过的两种方法:

import numpy as np
import time
test_data = np.random.rand(6000000,12)
T1 = time.time()
np.savetxt('test',test_data, fmt='%.4f', delimiter=' ' )
T2 = time.time() 
print "Time:",T2-T1,"Sec"
file3=open('test2','w')
for i in range(6000000):
    for j in range(12):
        file3.write('%6.4f\t' % (test_data[i][j]))
    file3.write('\n')
T3 = time.time() 
print "Time:",T3-T2,"Sec" 

Time: 56.6293179989 Sec 时间:56.6293179989秒

Time: 115.468323946 Sec 时间:115.468323946秒

I am dealing with atleast 100 files like this and the total time is a lot, please help. 我正在处理至少100个这样的文件,总时间很多,请帮忙。 Also, I am not writing in .npy or compressed format as I need to read them in matlab and do further processing. 另外,我不是用.npy或压缩格式编写的,因为我需要在matlab中阅读它们并进行进一步处理。

Have you considered h5py ? 你考虑过h5py吗?

Here's a cursory single-run time comparison: 这是粗略的单次运行时间比较:

>>> import numpy as np
>>> import time
>>> import h5py
>>> test_data = np.random.rand(6000000,12)
>>> file = h5py.File('arrays.h5', 'w')

>>> %time file.create_dataset('test_data', data=test_data, dtype=data.dtype)
CPU times: user 1.28 ms, sys: 224 ms, total: 225 ms
Wall time: 280 ms
<HDF5 dataset "test_data": shape (6000000, 12), type "<f8">

>>> %time np.savetxt('test',test_data, fmt='%.4f', delimiter=' ' )
CPU times: user 24.4 s, sys: 617 ms, total: 25 s
Wall time: 26.3 s

>>> file.close()

save will almost always be hugely faster than savetxt . save几乎总是比savetxt It just dumps the raw bytes, without having to format them as text. 它只是转储原始字节,而不必将其格式化为文本。 It also writes smaller files, which means less I/O. 它还写入较小的文件,这意味着更少的I / O。 And you'll get equal benefits at load time: less I/O, and no text parsing. 在加载时,您将获得同等的好处:更少的I / O,并且没有文本解析。

Everything else below is basically a variant on top of the benefits of save . 除了save的好处之外,下面的所有其他内容基本上都是变体。 And if you look at the times at the end, all of them are within an order of magnitude of each other, but all around two orders of magnitude faster than savetxt . 而且,如果您查看末尾的时间,所有时间都在一个数量级内,但是都比savetxt快两个数量级。 So, you may just be happy with the 200:1 speedup and not care about trying to tweak things any farther. 因此,您可能只是对200:1的加速感到满意,而不在乎尝试进一步调整。 But, if you do need to optimize further, read on. 但是,如果您确实需要进一步优化,请继续阅读。


savez_compressed saves the array with DEFLATE compression. savez_compressed使用DEFLATE压缩保存数组。 This means you waste a bunch of CPU, but save some I/O. 这意味着您浪费了大量CPU,但节省了一些I / O。 If it's a slow disk that's slowing you down, that's a win. 如果这是一个慢速的磁盘使您减速,那将是一个胜利。 Note that with smallish arrays, the constant overhead will probably hurt more than the compression speedup will help, and if you have a random array there's little to no compression possible. 请注意,对于较小的数组,恒定的开销可能会比压缩加速所带来的帮助更大,并且,如果您使用的是随机数组,则压缩的可能性很小甚至没有。

savez_compressed is also a multi-array save. savez_compressed也是savez_compressed组保存。 That may seem unnecessary here, but if you chunk a huge array into, say, 20 smaller ones, this can sometimes go significantly faster. 在这里这似乎没有必要,但是如果将一个巨大的数组分成20个较小的数组,则有时速度可能会大大加快。 (Even though I'm not sure why.) The cost is that if you just load ip the .npz and stack the arrays back together, you don't get contiguous storage, so if that matters, you have to write more complicated code. (即使我不确定为什么。)这样做的代价是,如果只load ip .npz并将数组stack回去,就不会获得连续的存储,因此,如果要这样做,则必须编写更复杂的代码。

Notice that my test below uses a random array, so the compression is just wasted overhead. 注意,我下面的测试使用随机数组,因此压缩只是浪费了开销。 But testing against zeros or arange would be just as misleading in the opposite direction, so… this is something to test on your real data. 但测试针对zerosarange将只是在相反方向误导,所以......这是后话,以测试你的真实数据。

Also, I'm on a computer with a pretty fast SSD, so the tradeoff between CPU and I/O may not be as imbalanced as on whatever machine you're running on. 另外,我所用的计算机具有非常快的SSD,因此在CPU和I / O之间的权衡可能不会像在运行的任何计算机上那样不平衡。


numpy.memmap , or an array allocated into a stdlib mmap.mmap , is backed to disk with a write-through cache. numpy.memmap或分配到stdlib mmap.mmap的数组使用mmap.mmap写式缓存备份到磁盘。 This shouldn't reduce the total I/O time, but it means that the I/O doesn't happen all at once at the end, but is instead spread around throughout your computation—which often means it can happen in parallel with your heavy CPU work. 这不应减少总的I / O时间,但这意味着I / O不会在一次结束时一次全部发生,而是会分散到整个计算过程中-这通常意味着它可以与您的计算并行发生繁重的CPU工作。 So, instead of spending 50 minutes calculating and then 10 minutes saving, you spend 55 minutes calculating-and-saving. 因此,您不用花费50分钟进行计算然后节省10分钟,而是花费55分钟进行计算并保存。

This one is hard to test in any sensible way with a program that isn't actually doing any computation, so I didn't bother. 用一个实际上不做任何计算的程序,很难以任何明智的方式来测试这一程序,因此我没有理会。


pickle or one of its alternatives like dill or cloudpickle . pickle或其替代品之一,如dillcloudpickle There's really no good reason a pickle should be faster than a raw array dump, but occasionally it seems to be. 确实没有充分的理由说明泡菜应该比原始数组转储更快,但有时似乎是这样。

For a simple contiguous array like the one in my tests, the pickle is just a small wrapper around the exact same bytes as the binary dump, so it's just pure overhead. 对于像我的测试中那样的简单连续数组,pickle只是与二进制转储完全相同的字节周围的一个小包装,因此仅是开销。


For comparison, here's how I'm testing each one: 为了进行比较,这是我测试每个的方法:

In [70]: test_data = np.random.rand(1000000,12)
In [71]: %timeit np.savetxt('testfile', test_data)
9.95 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: os.stat('testfile').st_size
Out[74]: 300000000

Notice the use of %timeit there. 注意那里使用了%timeit If you're not using IPython, use the timeit module in the stdlib to do the same thing a little verbosely. 如果您不使用IPython,请使用stdlib中的timeit模块稍微冗长地执行相同的操作。 Testing with time has all kinds of problems (as described in the timeit docs, but the biggest is that you're only doing a single rep. And for I/O-based benchmarks, that's especially bad. time测试存在各种问题(如timeit文档中所述,但是最大的问题是您只做一个代表。对于基于I / O的基准测试,这尤其糟糕。


Here's the results for each—but, given the caveats above, you should really only consider the first two meaningful. 这是每个结果的结果-但是,考虑到以上警告,您实际上应该只考虑前两个有意义。

  • savetxt : 9.95s, 300MB savetxt :9.95s,300MB
  • save : 45.8 ms, 96MB save :45.8毫秒,96MB
  • savez_compressed : 360ms, 90MB savez_compressed :360ms,90MB
  • pickle : 287ms, 96MB pickle :287ms,96MB

How about using pickle? 泡菜怎么样 I found that it is more fast. 我发现它更快。

import numpy as np
import time
import pickle
test_data = np.random.rand(1000000,12)

T1 = time.time()
np.savetxt('testfile',test_data, fmt='%.4f', delimiter=' ' )
T2 = time.time()
print ("Time:",T2-T1,"Sec")

file3=open('testfile','w')
for i in range(test_data.shape[0]):
    for j in range(test_data.shape[1]):
        file3.write('%6.4f\t' % (test_data[i][j]))
    file3.write('\n')
file3.close()
T3 = time.time()
print ("Time:",T3-T2,"Sec")

file3 = open('testfile','wb')
pickle.dump(test_data, file3)
file3.close()
T4 = time.time()
print ("Time:",T4-T3,"Sec")

# load data
file4 = open('testfile', 'rb')
obj = pickle.load(file4)
file4.close()
print(obj)

the output is 输出是

Time: 9.1367928981781 Sec
Time: 16.366491079330444 Sec
Time: 0.41736602783203125 Sec

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM