用于列表列表的空间高效数据存储。元素是整数，所有列表的大小都不同

Question

Say my data looks like this说我的数据看起来像这样

thisList = [
     [[13, 43, 21, 4], [33, 2, 111, 33332, 23, 43, 2, 2], [232, 2], [23, 11]] ,
     [[21, 2233, 2], [2, 3, 2,1, 32, 22], [3]], 
     [[3]], 
     [[23, 12], [55, 3]],
     ....
]

What is the most space-efficient way to store this time of data?存储这段时间数据最节省空间的方式是什么？

I looked at Numpy files, but numpy only supports uniform length data我查看了 Numpy 文件，但 numpy 只支持统一长度数据

I looked at Hdf5, which has support for 1d ragged tensors, but not 2d我查看了 Hdf5，它支持 1d 不规则张量，但不支持 2d

https://stackoverflow.com/a/42659049/3259896 https://stackoverflow.com/a/42659049/3259896

So there's an option of creating a separate hdf5 file for every list in thisList , but I would have potentially 10-20 million those lists.因此，可以选择为thisList中的每个列表创建一个单独的 hdf5 文件，但我可能有 10-2000 万个这些列表。

Answer 1

I ran benchmarks saving a ragged nested list with JSON, BSON, Numpy, and HDF5.我使用 JSON、BSON、Numpy 和 HDF5 运行基准测试来保存一个参差不齐的嵌套列表。

TLDR: use compressed JSON, because it is the most space efficient and easiest to encode/decode. TLDR：使用压缩的 JSON，因为它是最节省空间且最容易编码/解码的。

On the synthetic data, here are the results (with du -sh test* ):在合成数据上，结果如下（使用du -sh test* ）：

4.5M    test.json.gz
7.5M    test.bson.gz
8.5M    test.npz
261M    test_notcompressed.h5
1.3G    test_compressed.h5

Compressed JSON is the most efficient in terms of storage, and it is also the easiest to encode and decode because the ragged list does not have to be converted to a mapping.压缩后的 JSON 在存储方面是最高效的，而且它也是最容易编码和解码的，因为不规则的列表不必转换为映射。 BSON comes in second, but it has to be converted to a mapping, which complicates encoding and decoding (and negating the encoding/decoding speed benefits of BSON over JSON). BSON 排在第二位，但它必须转换为映射，这使编码和解码变得复杂（并且否定了 BSON 相对于 JSON 的编码/解码速度优势）。 Numpy's compressed NPZ format is third best, but like BSON, the ragged list must be made into a dictionary before saving. Numpy 的压缩 NPZ 格式排在第三位，但和 BSON 一样，在保存之前必须将参差不齐的列表制作成字典。 HDF5 is surprisingly large, especially compressed. HDF5 非常大，尤其是压缩的。 This is probably because there are many different datasets, and compression adds overhead to each dataset.这可能是因为有许多不同的数据集，并且压缩会增加每个数据集的开销。

Benchmarks基准

Here is the relevant code for the benchmarking.这是基准测试的相关代码。 The bson package is part of pymongo . bson package 是pymongo的一部分。 I ran these benchmarks on a Debian Buster machine with an ext4 filesystem.我在带有ext4文件系统的 Debian Buster 机器上运行了这些基准测试。

def get_ragged_list(length=100000):
    """Return ragged nested list."""
    import random

    random.seed(42)
    l = []
    for _ in range(length):
        n_sublists = random.randint(1, 9)
        sublist = []
        for i in range(n_sublists):
            subsublist = [random.randint(0, 1000) for _ in range(random.randint(1, 9))]
            sublist.append(subsublist)
        l.append(sublist)
    return l

def save_json_gz(obj, filepath):
    import gzip
    import json

    json_str = json.dumps(obj)
    json_bytes = json_str.encode()
    with gzip.GzipFile(filepath, mode="w") as f:
        f.write(json_bytes)

def save_bson(obj, filepath):
    import gzip
    import bson

    d = {}
    for ii, n in enumerate(obj):
        for jj, nn in enumerate(n):
            key = f"{ii}/{jj}"
            d[key] = nn
    b = bson.BSON.encode(d)
    with gzip.GzipFile(filepath, mode="w") as f:
        f.write(b)

def save_numpy(obj, filepath):
    import numpy as np

    d = {}
    for ii, n in enumerate(obj):
        for jj, nn in enumerate(n):
            key = f"{ii}/{jj}"
            d[key] = nn
    np.savez_compressed(filepath, d)

def save_hdf5(obj, filepath, compression="lzf"):
    import h5py

    with h5py.File(filepath, mode="w") as f:
        for ii, n in enumerate(obj):
            for jj, nn in enumerate(n):
                name = f"{ii}/{jj}"
                f.create_dataset(name, data=nn, compression=compression)

ragged = get_ragged_list()

save_json_gz(ragged, "ragged.json.gz")
save_bson(ragged, "test.bson.gz")
save_numpy(ragged, "ragged.npz")
save_hdf5(ragged, "test_notcompressed.h5", compression=None)
save_hdf5(ragged, "test_compressed.h5", compression="lzf")

Versions of relevant packages:相关包的版本：

python 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 18:16:37) [GCC 7.3.0]
pymongo bson 3.10.1
numpy 1.18.2
h5py 2.10.0

用于列表列表的空间高效数据存储。元素是整数，所有列表的大小都不同

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-05-08 15:00:49

Benchmarks基准

用于列表列表的空间高效数据存储。 元素是整数，所有列表的大小都不同

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-05-08 15:00:49

Benchmarks基准

用于列表列表的空间高效数据存储。元素是整数，所有列表的大小都不同

解决方案1
2 已采纳 2020-05-08 15:00:49