简体   繁体   中英

Size on disk of a partly filled HDF5 dataset

I'm reading the book Python and HDF5 (O'Reilly) which has a section on empty datasets and the size they take on disk:

import numpy as np
import h5py

f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32)
f.flush()
# Size on disk is 1KB

dset[0:1024] = np.arange(1024)
f.flush()
# Size on disk is 4GB

After filling part (first 1024 entries) of the dataset with values, I expected the file to grow, but not to 4GB. It's essentially the same size as when I do:

dset[...] = np.arange(1024**3)

The book states that the file size on disk should be around 66KB. Could anyone explain what the reason is for the sudden size increase?

Version info:

  • Python 3.6.1 (OSX)
  • h5py 2.7.0

If you open your file in HdfView you can see that chunking is off. This means that the array is stored in one contiguous block of memory in the file and cannot be resized. Thus all 4 GB must be allocated in the file.

If you create your data set with chunking enabled, the dataset is divided up into regularly-sized pieces which are stored haphazardly on disk, and indexed using a B-tree. In that case only the chunks that have (at least one element of) data are allocated on disk. If you create your dataset as follows the file will be much smaller:

dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=True)

The chunks=True lets h5py determine the size of the chunks automatically. You can also set the chunk size explicitly. For example, to set it to 16384 floats (=64 Kb), use:

dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=(2**14,) )

The best chunk size depends on the reading and writing patterns of your applications. Note that:

Chunking has performance implications. It's recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.

See http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM