Space efficient data store for list of list of lists. Elements are integers, and size of all lists varies in length

Question

Say my data looks like this

thisList = [
     [[13, 43, 21, 4], [33, 2, 111, 33332, 23, 43, 2, 2], [232, 2], [23, 11]] ,
     [[21, 2233, 2], [2, 3, 2,1, 32, 22], [3]], 
     [[3]], 
     [[23, 12], [55, 3]],
     ....
]

What is the most space-efficient way to store this time of data?

I looked at Numpy files, but numpy only supports uniform length data

I looked at Hdf5, which has support for 1d ragged tensors, but not 2d

https://stackoverflow.com/a/42659049/3259896

So there's an option of creating a separate hdf5 file for every list in thisList , but I would have potentially 10-20 million those lists.

Answer 1

I ran benchmarks saving a ragged nested list with JSON, BSON, Numpy, and HDF5.

TLDR: use compressed JSON, because it is the most space efficient and easiest to encode/decode.

On the synthetic data, here are the results (with du -sh test* ):

4.5M    test.json.gz
7.5M    test.bson.gz
8.5M    test.npz
261M    test_notcompressed.h5
1.3G    test_compressed.h5

Compressed JSON is the most efficient in terms of storage, and it is also the easiest to encode and decode because the ragged list does not have to be converted to a mapping. BSON comes in second, but it has to be converted to a mapping, which complicates encoding and decoding (and negating the encoding/decoding speed benefits of BSON over JSON). Numpy's compressed NPZ format is third best, but like BSON, the ragged list must be made into a dictionary before saving. HDF5 is surprisingly large, especially compressed. This is probably because there are many different datasets, and compression adds overhead to each dataset.

Benchmarks

Here is the relevant code for the benchmarking. The bson package is part of pymongo . I ran these benchmarks on a Debian Buster machine with an ext4 filesystem.

def get_ragged_list(length=100000):
    """Return ragged nested list."""
    import random

    random.seed(42)
    l = []
    for _ in range(length):
        n_sublists = random.randint(1, 9)
        sublist = []
        for i in range(n_sublists):
            subsublist = [random.randint(0, 1000) for _ in range(random.randint(1, 9))]
            sublist.append(subsublist)
        l.append(sublist)
    return l

def save_json_gz(obj, filepath):
    import gzip
    import json

    json_str = json.dumps(obj)
    json_bytes = json_str.encode()
    with gzip.GzipFile(filepath, mode="w") as f:
        f.write(json_bytes)

def save_bson(obj, filepath):
    import gzip
    import bson

    d = {}
    for ii, n in enumerate(obj):
        for jj, nn in enumerate(n):
            key = f"{ii}/{jj}"
            d[key] = nn
    b = bson.BSON.encode(d)
    with gzip.GzipFile(filepath, mode="w") as f:
        f.write(b)

def save_numpy(obj, filepath):
    import numpy as np

    d = {}
    for ii, n in enumerate(obj):
        for jj, nn in enumerate(n):
            key = f"{ii}/{jj}"
            d[key] = nn
    np.savez_compressed(filepath, d)

def save_hdf5(obj, filepath, compression="lzf"):
    import h5py

    with h5py.File(filepath, mode="w") as f:
        for ii, n in enumerate(obj):
            for jj, nn in enumerate(n):
                name = f"{ii}/{jj}"
                f.create_dataset(name, data=nn, compression=compression)

ragged = get_ragged_list()

save_json_gz(ragged, "ragged.json.gz")
save_bson(ragged, "test.bson.gz")
save_numpy(ragged, "ragged.npz")
save_hdf5(ragged, "test_notcompressed.h5", compression=None)
save_hdf5(ragged, "test_compressed.h5", compression="lzf")

Versions of relevant packages:

python 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 18:16:37) [GCC 7.3.0]
pymongo bson 3.10.1
numpy 1.18.2
h5py 2.10.0

Space efficient data store for list of list of lists. Elements are integers, and size of all lists varies in length

Question

1 answers

solution1
2 ACCPTED 2020-05-08 15:00:49

Benchmarks

Space efficient data store for list of list of lists. Elements are integers, and size of all lists varies in length

Question

1 answers

solution1 2 ACCPTED 2020-05-08 15:00:49

Benchmarks

solution1
2 ACCPTED 2020-05-08 15:00:49