简体   繁体   English

Python-将HDF5数据集读入列表与numpy数组

[英]Python - Reading HDF5 dataset into a list vs numpy array

I've been exploring HDF5 and its python interface (h5py) so I tried to read a HDF5 file (one dimensional array of 100 million integers) into: a normal list and another time to a numpy array. 我一直在探索HDF5及其python接口(h5py),所以我尝试将HDF5文件(一亿个整数的一维数组)读入:一个普通列表,另一个时间读取到numpy数组。 Converting the dataset to a numpy was very fast comparing to when I tried to convert it to a normal python list (actually doing it with a list took a very long time that I had to kill it before it finished). 与我尝试将数据集转换为普通python列表时相比,将数据集转换为numpy的速度非常快(实际上,使用列表进行处理花费了很长时间,因此我不得不在完成之前将其杀死)。

Can any one help me understand what happens internally that makes converting HDF5 dataset to a numpy array extremely faster than doing it with a normal list? 有谁能帮助我了解内部发生的事情,这使得将HDF5数据集转换为numpy数组的速度比使用普通列表快得多吗? Does it has to do with h5py compatibility with numpy? 它与numpy的h5py兼容性有关吗?

import numpy as np
import hdf5

def readtolist(dataset):
    return "normal list count = {0}".format(len(list(dataset['1'])))

def readtonp(dataset):
    n1 = np.array(dataset)
    return "numpy count = {0}".format(len(n1))

f = h5py.File(path, 'r')    
readtolist(f['1'])
readtonp(f['1'])

Thanks for the help! 谢谢您的帮助!

Using a test file that I recently created: 使用我最近创建的测试文件:

In [78]: f = h5py.File('test.h5')
In [79]: list(f.keys())
Out[79]: ['x']
In [80]: f['x']
Out[80]: <HDF5 dataset "x": shape (2, 5), type "<i8">
In [81]: x = f['x'][:]
In [82]: x
Out[82]: 
array([[0, 2, 4, 6, 8],
       [1, 3, 5, 7, 9]])
In [83]: alist = x.tolist()
In [84]: alist
Out[84]: [[0, 2, 4, 6, 8], [1, 3, 5, 7, 9]]

Data storage in HDF5 is similar to numpy arrays. HDF5数据存储类似于numpy数组。 h5py uses compiled code ( cython ) to interface with HDF5 base code. h5py使用编译后的代码( cython )与HDF5基本代码进行交互。 It loads the datasets as numpy arrays. 它将数据集加载为numpy数组。

To get a list, then, you have to convert the array to a list. 然后,要获取列表,您必须将数组转换为列表。 For a 1d array, list(x) sort of works, but it is slow and incomplete. 对于一维数组, list(x)可以工作,但是它很慢且不完整。 tolist() is the correct way. tolist()是正确的方法。

list() iterates over the first dimension of the array: list()遍历数组的第一维:

In [85]: list(x)
Out[85]: [array([0, 2, 4, 6, 8]), array([1, 3, 5, 7, 9])]
In [86]: list(f['x'])
Out[86]: [array([0, 2, 4, 6, 8]), array([1, 3, 5, 7, 9])]

1211:~/mypy$ h5dump test.h5
HDF5 "test.h5" {
GROUP "/" {
   DATASET "x" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 2, 5 ) / ( 2, 5 ) }
      DATA {
      (0,0): 0, 2, 4, 6, 8,
      (1,0): 1, 3, 5, 7, 9
      }
   }
}
}

I should add that a Python list is a unique data structure. 我应该补充一点,Python list是唯一的数据结构。 It contains pointers to objects elsewhere in memory, and thus can hold all kinds of objects - numbers, other lists, dictionaries, strings, custom classes, etc. A HDF5 dataset, like a numpy array, has to have a uniform data type (the DATATYPE in the dump). 它包含指向内存中其他位置的对象的指针,因此可以保存所有类型的对象-数字,其他列表,字典,字符串,自定义类等HDF5数据集(如numpy数组)必须具有统一的数据类型( DATATYPE在转储)。 It can't, for example, store an object dtype array. 例如,它不能存储对象dtype数组。 If you want to save a list to a HDF5 , you first have to convert it an array. 如果要将列表保存到HDF5 ,则首先必须将其转换为数组。

HDF5 is a file format intended for storing large quantities of scientific array data. HDF5是一种文件格式,用于存储大量科学数组数据。 It can store multiple datasets and it offer multiple on-the-fly compression models, enabling data with repeated patterns to be stored more efficiently. 它可以存储多个数据集,并提供多个动态压缩模型,从而可以更有效地存储具有重复模式的数据。

Usually parsing it using Pandas or Numpy would be much faster as they handle that compression in a vectorized form, while native Python list handles that via nesting, which is significantly slower. 通常,使用Pandas或Numpy解析它会更快,因为它们以矢量化形式处理该压缩,而本机Python list通过嵌套处理,这要慢得多。

This is basically it in a nutshell. 基本上就是这样。

The following is an experiment using pandas and Numpy under the hood to generate a file of 100000 entries, store it in HDF5 and then parse it again. 以下是使用pandasNumpy在引擎盖下生成100000个条目的文件并将其存储在HDF5中,然后再次对其进行解析的实验。

Generating the Data 产生资料

frame = pd.DataFrame({'a': np.random.randn(100000)})
store = pd.HDFStore('mydata.h5')
frame.to_hdf('mydata.h5', 'obj1', format='table')
store.close()

Timing the Parsing 定时解析

%%timeit
df = pd.read_hdf('mydata.h5', 'obj1')

Output 产量

9.14 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Very speedy compared to list . list相比非常快速。

I'm not sure the difference you are seeing is specific to hdf5. 我不确定您看到的差异是特定于hdf5的。 A similar effect can be had using just a string as a data source: 仅使用字符串作为数据源,可以产生类似的效果:

>>> import numpy as np
>>> import string, random
>>> 
>>> from timeit import timeit    
>>> kwds = dict(globals=globals(), number=1000)
>>> 
>>> a = ''.join(random.choice(string.ascii_letters) for _ in range(1000000))
>>> 
>>> timeit("A = np.fromstring(a, dtype='S1')", **kwds)
0.06803569197654724
>>> timeit("L = list(a)", **kwds)
6.131339570041746

Simplifying slightly, for homogeneous data, numpy stores them 'as they are', as a block of memory, whereas a list creates a python object for each item. 对于同类数据,numpy进行了略微简化,numpy将它们“按原样”存储为一块内存,而列表为每个项目创建了一个python对象。 A python object in this case consists of the value, a pointer to its type object and a ref count. 在这种情况下,python对象由值,指向其类型对象的指针和引用计数组成。 So, in summary, numpy can essentially just copy a block of memory whereas list has to allocate and create all those objects and the list container. 因此,总而言之,numpy实际上只能复制一个内存块,而list必须分配并创建所有这些对象和list容器。

The flipside of this is that accessing individual elments is faster from lists because - amongst other things - now the array has to create a python object whereas the list can simply return the one it has stored: 不利的一面是,从列表访问单个元素的速度更快,这是因为-除其他事项外-现在数组必须创建一个python对象,而列表可以简单地返回它存储的对象:

>>> L = list(a)
>>> A = np.fromstring(a, dtype='S1')
>>> 
>>> kwds = dict(globals=globals(), number=100)
>>> 
>>> timeit("[L[i] for i in range(len(L))]", **kwds)
5.607562301913276
>>> timeit("[A[i] for i in range(len(A))]", **kwds)
13.343806453049183

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM