简体   繁体   English

在ASCII文件中存储numpy数组的最佳方法

[英]best way to store numpy arrays in ascii files

I often have processed numpy arrays that come as a result of lengthy computations. 由于冗长的计算,我经常处理numpy数组。 I need to use them elsewhere in calculations. 我需要在计算的其他地方使用它们。 I currently 'pickle' them and unpickle the files into variables as and when I need them. 目前,我可以“修补”它们,并在需要时将其解钉为变量。

I noticed for large data sizes (~1M data points), this is slow. 我注意到对于大数据量(〜1M数据点),这很慢。 I read elsewhere that pickling is not best way to store huge files. 我在其他地方读到,酸洗不是存储大文件的最佳方法。 I would like to store and read them as ASCII files efficiently to load directly into a numpy array. 我想有效地存储和读取它们作为ASCII文件,以直接加载到numpy数组中。 What is the best way to do this? 做这个的最好方式是什么?

say I have a 100k x 3 2D array in a variable 'a'. 说我在变量“ a”中有一个100k x 3 2D数组。 I want to store it in an ASCII file and load it into a numpy array variable 'b'. 我想将其存储在ASCII文件中并将其加载到numpy数组变量'b'中。

If you want efficiency, ASCII will not be the case. 如果要提高效率,则不是ASCII。 The problem with pickle is that it is dependent on the python version, so it's not a good idea for long term storage. pickle的问题在于它依赖于python版本,因此长期存储不是一个好主意。 You can try to use other binary technologies, where the most straightforward solution would be to use the numpy.save method as documented here . 您可以尝试使用其他二进制技术,其中最直接的解决方案是使用此处记录的numpy.save方法。

Numpy has a range of input and output methods that will do exactly what you are after. Numpy具有一系列输入和输出方法 ,可以完全满足您的需求。

One option would be numpy.save : 一种选择是numpy.save

import numpy as np

my_array = np.array([1,2,3,4])
with open('data.txt', 'wb') as f:
    np.save(f, my_array, allow_pickle=False)

To load your data again: 要再次加载数据:

with open('data.txt', 'rb') as f:
    my_loaded_array = np.load(f)

The problem you pose is directly related to the size of the dataset. 您提出的问题与数据集的大小直接相关。

There are several solutions to this quite common problem that come with specialized libraries. 专用库提供了一些解决此常见问题的解决方案。

  1. Python-only persistence: joblib offers an alternative to pickle specifically for storing files that are too large for convenient pickling. 仅限Python的持久性:joblib提供了替代pickle方法,专门用于存储太大的文件而无法方便地进行pickle。
  2. HDF5 is a file format that is specifically targeted for storing arrays. HDF5是一种专门用于存储阵列的文件格式。 The format is multi-language and multi-platform but a very good Python library exists for it: h5py 格式是多语言和多平台的,但是有一个非常好的Python库: h5py

An example with h5py. 以h5py为例。 To write the data: 写入数据:

import h5py
with h5py.File('data.h5', 'w') as f:
    f.create_dataset('a', data=a)

To read the data: 读取数据:

import h5py
with h5py.File('data.h5', 'r') as f:
    b = f['a'][:]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM