H5Py 和存储

Question

I am writing some code which needs to save a very large numpy array to memory.我正在编写一些代码，需要将一个非常大的 numpy 数组保存到 memory。 The numpy array is so large in fact that I cannot load it all into memory at once. numpy 数组实际上太大了，以至于我无法一次将其全部加载到 memory 中。 But I can calculate the array in chunks.但我可以分块计算数组。 Ie my code looks something like:即我的代码看起来像：

for i in np.arange(numberOfChunks):

   myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation

As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time.由于我无法一次将myArray加载到 memory 中，因此我想一次将其保存到一个“块”文件中。 ie I want to do something like this:即我想做这样的事情：

for i in np.arange(numberOfChunks):

   myArrayChunk = #... do some calculation to obtain chunk

   saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)

I understand this can be done with h5py but I am a little confused how to do this.我知道这可以用h5py完成，但我有点困惑如何做到这一点。 My current understanding is that I can do this:我目前的理解是我可以这样做：

import h5py

# Make the file
h5py_file = h5py.File(filename, "a")

# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")


for i in np.arange(numberOfChunks):

   myArrayChunk = #... do some calculation to obtain chunk

   myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk

But this is where I become a little confused.但这就是我变得有点困惑的地方。 I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] , then this part of myArray has now been read into memory.我已经读过，如果您像我在编写myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]时那样索引h5py数据类型，那么myArray的这一部分现在已被读入 memory . So surely, by the end of my loop above, have I not still got the whole of myArray in memory now?所以可以肯定的是，在我上面的循环结束时，我现在还没有在 memory 中得到整个myArray吗？ How has this saved my memory?这如何拯救了我的 memory？

Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation.同样，稍后，我想一次将一个块读入我的文件，做进一步的计算。 ie I would like to do something like:即我想做类似的事情：

import h5py

# Read in the file
h5py_file = h5py.File(filename, "a")

# Read in myArray
myArray = h5py_file['myArray']

for i in np.arange(numberOfChunks):

   # Read in chunk
   myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]

   # ... Do some calculation on myArrayChunk

But by the end of this loop is the whole of myArray now in memory?但是到这个循环结束时，整个myArray现在都在 memory 中了吗？ I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't.我对myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]何时在 memory 中以及何时不在时感到有些困惑。 Please could someone explain this.请有人解释一下。

Answer 1

You have the basic idea.你有基本的想法。 Take care when saying "save to memory".说“保存到内存”时要小心。 NumPy arrays are saved in memory (RAM). NumPy arrays 保存在 memory (RAM) 中。 HDF5 data is saved on disk (not to memory/RAM,). HDF5 数据保存在磁盘上（而不是内存/RAM）。 then accessed (memory used depends on how you access).然后访问（使用的内存取决于您的访问方式）。 In the first step you are creating and writing data in chunks to the disk.在第一步中，您将创建数据块并将其写入磁盘。 In the second step you are accessing data from disk in chunks.在第二步中，您将从磁盘中分块访问数据。 Working example provided at the end.最后提供的工作示例。

When reading data with h5py there 2 ways to read the data:使用h5py读取数据时，有两种读取数据的方法：
This returns a NumPy array:这将返回一个 NumPy 数组：
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:这将返回一个 h5py 数据集 object，其操作类似于 NumPy 数组：
myArrayDS = myArray

The difference: h5py dataset objects are not read into memory all at once.区别：h5py 数据集对象不会一次全部读入 memory。 You can then slice them as needed.然后，您可以根据需要对它们进行切片。 Continuing from above, this is a valid operation to get a subset of the data:从上面继续，这是获取数据子集的有效操作：
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]

My example also corrects 1 small error in your chunksize increment equation.我的示例还纠正了块大小增量方程中的 1 个小错误。 You had:你有过：
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:你要：
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk

Working Example (writes and reads):工作示例（写入和读取）：

import h5py
import numpy as np

# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:

    numberOfChunks = 3
    chunkSize = 4
    print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
    # Write dataset to disk
    h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")

    for i in range(numberOfChunks):

       h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
       print (h5ArrayChunk)

       h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk


with h5py.File("SO_61173314.h5", "r") as h5r:
    print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )

    # Access myArray dataset - Note: This is NOT a NumpPy array
    myArray = h5r['myArray']

    for i in range(numberOfChunks):

       # Read a chunk into memory (as a NumPy array)
       myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]

       # ... Do some calculation on myArrayChunk  
       print (myArrayChunk)

H5Py 和存储

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-13 13:35:44

H5Py 和存储

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-13 13:35:44

解决方案1
1 已采纳 2020-04-13 13:35:44