Is the bytes offset in files produced by np.save always 128?

Question

I've been playing with numpy memmaps and I noticed that if I generated some data and dumped it to disk as such:

from os.path import join
import numpy as np
import tempfile

print('Generate dummy data')
N = 4
D = 3
x, y = np.meshgrid(np.arange(0, N), np.arange(0, N))
data = np.ascontiguousarray((np.dstack([x] * D) % 256).astype(np.uint8))

print('Make temp directory')
dpath = tempfile.mkdtemp()
mem_fpath = join(dpath, 'foo.npy')

print('Dump memmap')
np.save(mem_fpath, data)

then the data produced by np.memmap and np.load was different.

file1 = np.memmap(mem_fpath, dtype=data.dtype.name, shape=data.shape,
                  mode='r')
file2 = np.load(mem_fpath)
print('file1 =\n{!r}'.format(file1[0]))
print('file2 =\n{!r}'.format(file2[0]))

resulting in

file1 =
memmap([[147,  78,  85],
        [ 77,  80,  89],
        [  1,   0, 118],
        [  0, 123,  39]], dtype=uint8)
file2 =
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]], dtype=uint8)

This puzzled me, but eventually I figured out that I need to set the offset parameter in np.memmap to 128 to make it work:

for i in range(0, 1000):
    file1 = np.memmap(mem_fpath, dtype=data.dtype.name, shape=data.shape,
                      offset=i, mode='r')
    if np.all(file1 == data):
        print('i = {!r}'.format(i))
        break

print('file1 =\n{!r}'.format(file1[0]))

results in the exepcted

i = 128
file1 =
memmap([[0, 0, 0],
        [1, 1, 1],
        [2, 2, 2],
        [3, 3, 3]], dtype=uint8)

My question is where is this 128 number coming from. I checked the np.save docs, and I don't see a reference to it. I also tried modifying the dtype and shape of the data, but I always found that the offset was 128. Can I assume that any single array saved with np.save will always have this 128 offset? If not, how can I determine what the offset should be.

The reason I ask is because I've found that using np.memmap is much faster than np.load for my particular use case of cropping small regions from much larger files on disk.

Thank you for any help!

Answer 1

The 128-byte offset you're seeing should be considered a fluke of the implementation. The length of an NPY file header is required to be a multiple of 16, and the implementation currently aligns to 64 bytes because 16 isn't enough for memory-mapping on all platforms.

128 bytes will be a very common header size, since boilerplate in the header takes about 64 bytes, and most arrays don't have a format complex enough to require more than 128 bytes of header to describe them. However, structured arrays can easily result in a header longer than 128 bytes, and NPY files produced by an older NumPy version or a different implementation of the format could have a different alignment.

Is the bytes offset in files produced by np.save always 128?

Question

1 answers

solution1
2 2018-07-12 21:49:05

Is the bytes offset in files produced by np.save always 128?

Question

1 answers

solution1 2 2018-07-12 21:49:05

solution1
2 2018-07-12 21:49:05