简体   繁体   English

使用 h5py 从大文件中读取而不将整个文件加载到内存中

[英]Read from a large file without loading whole thing into memory using h5py

Does the following read from a dataset without loading the entire thing at once into memory [the whole thing will not fit into memory] and get the size of the dataset without loading the data using h5py in python?以下是否从数据集读取而不将整个事物一次加载到内存中[整个事物将不适合内存] 并在不使用 python 中的 h5py 加载数据的情况下获取数据集的大小? if not, how?如果没有,怎么办?

h5 = h5py.File('myfile.h5', 'r')
mydata = h5.get('matirx') # are all data loaded into memory by using h5.get?
part_of_mydata= mydata[1000:11000,:]
size_data =  mydata.shape 

Thanks.谢谢。

get (or indexing) fetches a reference to the Dataset on the file, but does not load any data. get (或索引)获取对文件中数据集的引用,但不加载任何数据。

In [789]: list(f.keys())
Out[789]: ['dset', 'dset1', 'vset']
In [790]: d=f['dset1']
In [791]: d
Out[791]: <HDF5 dataset "dset1": shape (2, 3, 10), type "<f8">
In [792]: d.shape         # shape of dataset
Out[792]: (2, 3, 10)
In [793]: arr=d[:,:,:5]    # indexing the set fetches part of the data
In [794]: arr.shape
Out[794]: (2, 3, 5)
In [795]: type(d)
Out[795]: h5py._hl.dataset.Dataset
In [796]: type(arr)
Out[796]: numpy.ndarray

d the Dataset is array like, but not actually a numpy array. d数据集类似于数组,但实际上不是一个numpy数组。

Fetch the whole Dataset with:使用以下命令获取整个数据集:

In [798]: arr = d[:]
In [799]: type(arr)
Out[799]: numpy.ndarray

Exactly how of the file it has to read to fetch yourslice depends on the slicing, data layout, chunking, and other things that generally aren't under your control, and shouldn't worry you.它必须读取文件以获取切片的确切方式取决于切片、数据布局、分块和其他通常不受您控制的事情,您不必担心。

Note also that when reading one dataset I'm not loading the others.另请注意,在读取一个数据集时,我不会加载其他数据集。 Same would apply to groups.同样适用于团体。

http://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data http://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM