简体   繁体   中英

Can a numpy file be created without defining an array?

I have some very large data to deal with. I'd like to be able to use np.load(filename, mmap_mode="r+") to use these files on disk rather than RAM. My issue is that creating them in RAM causes the exact problem I'm trying to avoid.

I know about np.memmap already and that is a potential solution, but creating a memmap and then saving the array using np.save(filename, memmap) means that I'd be doubling the disk space requirement even if only briefly and that isn't always an option. Primarily I don't want to use memmaps because the header information in .npy files (namely shape and dtype) is useful to have.

My question is, can I create a numpy file without needing to first create it in memory? That is, can I create a numpy file by just giving a dtype and a shape? The idea would be along the lines of np.save(filename, np.empty((x, y, z))) but I'm assuming that empty requires it to be assigned in memory before saving.

My current solution is:

def create_empty_numpy_file(filename, shape, dtype=np.float64):
    with tempfile.TemporaryFile() as tmp:
        memmap = np.memmap(tmp, dtype, mode="w+", shape=shape)
        np.save(filename, memmap)

EDIT

My final solution based on bnaeker's answer and a few details from numpy.lib.format:

class MockFlags:
    def __init__(self, shape, c_contiguous=True):
        self.c_contiguous = c_contiguous
        self.f_contiguous = (not c_contiguous) or (c_contiguous and len(shape) == 1)


class MockArray:
    def __init__(self, shape, dtype=np.float64, c_contiguous=True):
        self.shape = shape
        self.dtype = np.dtype(dtype)
        self.flags = MockFlags(shape, c_contiguous)

    def save(self, filename):
        if self.dtype.itemsize == 0:
            buffersize = 0
        else:
            # Set buffer size to 16 MiB to hide the Python loop overhead.
            buffersize = max(16 * 1024 ** 2 // self.dtype.itemsize, 1)

        n_chunks, remainder = np.divmod(
            np.product(self.shape) * self.dtype.itemsize, buffersize
        )

        with open(filename, "wb") as f:
            np.lib.format.write_array_header_2_0(
                f, np.lib.format.header_data_from_array_1_0(self)
            )

            for chunk in range(n_chunks):
                f.write(b"\x00" * buffersize)
            f.write(b"\x00" * remainder)

The Numpy file format is really simple . There are a few under-documented functions you can use to create the required header bytes from the metadata needed to build an array, without actually building one.

import numpy as np

def create_npy_header_bytes(
    shape, dtype=np.float64, fortran_order=False, format_version="2.0"
):
    # 4 or 2-byte unsigned integer, depending on version
    n_size_bytes = 4 if format_version[0] == "2" else 2
    magic = b"\x93NUMPY"
    version_info = (
        int(each).to_bytes(1, "little") for each in format_version.split(".")
    )

    # Keys are supposed to be alphabetically sorted
    header = {
        "descr": np.lib.format.dtype_to_descr(np.dtype(dtype)),
        "fortran_order": fortran_order,
        "shape": shape
    }

    # Pad header up to multiple of 64 bytes
    header_bytes = str(header).encode("ascii")
    header_len = len(header_bytes)
    current_length = header_len + len(magic) + 2 + n_size_bytes  # for version information
    required_length = int(np.ceil(current_length / 64.0) * 64)
    padding = required_length - current_length - 1  # For newline
    header_bytes += b" " * padding + b"\n"

    # Length of the header dict, including padding and newline
    length = len(header_bytes).to_bytes(n_size_bytes, "little")

    return b"".join((magic, *version_info, length, header_bytes))

You can test that it's equivalent with this snippet:

import numpy as np
import io
x = np.zeros((10, 3, 4))

first = create_npy_header_bytes(x.shape)
stream = io.BytesIO()
np.lib.format.write_array_header_2_0(
    stream, np.lib.format.header_data_from_array_1_0(x)
)
print(f"Library: {stream.getvalue()}")
print(f"Custom: {first}")

You should see something like:

Library: b"\x93NUMPY\x02\x00t\x00\x00\x00{'descr': '<f8', 'fortran_order': False, 'shape': (10, 3, 4), }                                                    \n"
Custom: b"\x93NUMPY\x02\x00t\x00\x00\x00{'descr': '<f8', 'fortran_order': False, 'shape': (10, 3, 4)}                                                      \n"

which match, except for the trailing comma inside the header dict representation. That will not matter, as this is required to be a valid Python literal string representation of a dict, which will happily ignore that comma if it's there.


As an alternative approach, you could mock out an object which has the required fields for the library functions used to make the header itself. For np.lib.format.header_data_from_array_1_0 , these seem to be .flags (which must have a field c_contiguous and/or f_contiguous ), and a dtype . That's actually much simpler, and would look like:

import numpy as np
import io

class MockFlags:
    def __init__(self, shape, c_contiguous=True):
        self.c_contiguous = c_contiguous
        self.f_contiguous = (not c_contiguous) or (c_contiguous and len(shape) == 1)

class MockArray:
    def __init__(self, shape, dtype=np.float64, c_contiguous=True):
        self.shape = shape
        self.dtype = np.dtype(dtype)
        self.flags = MockFlags(shape, c_contiguous)

mock = MockArray((10, 3, 4))
stream = io.BytesIO()
np.lib.format.write_array_header_2_0(
    stream, np.lib.format.header_data_from_array_1_0(mock)
)
print(stream.getvalue())

You should see:

b"\x93NUMPY\x02\x00t\x00\x00\x00{'descr': '<f8', 'fortran_order': False, 'shape': (10, 3, 4), }                                                    \n"

which happily matches what we have above, but without having to do the shitty work of counting bytes, mucking with padding, etc. Much more betterer:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM