简体   繁体   English

用于列表列表的空间高效数据存储。 元素是整数,所有列表的大小都不同

[英]Space efficient data store for list of list of lists. Elements are integers, and size of all lists varies in length

Say my data looks like this说我的数据看起来像这样

thisList = [
     [[13, 43, 21, 4], [33, 2, 111, 33332, 23, 43, 2, 2], [232, 2], [23, 11]] ,
     [[21, 2233, 2], [2, 3, 2,1, 32, 22], [3]], 
     [[3]], 
     [[23, 12], [55, 3]],
     ....
]

What is the most space-efficient way to store this time of data?存储这段时间数据最节省空间的方式是什么?

I looked at Numpy files, but numpy only supports uniform length data我查看了 Numpy 文件,但 numpy 只支持统一长度数据

I looked at Hdf5, which has support for 1d ragged tensors, but not 2d我查看了 Hdf5,它支持 1d 不规则张量,但不支持 2d

https://stackoverflow.com/a/42659049/3259896 https://stackoverflow.com/a/42659049/3259896

So there's an option of creating a separate hdf5 file for every list in thisList , but I would have potentially 10-20 million those lists.因此,可以选择为thisList中的每个列表创建一个单独的 hdf5 文件,但我可能有 10-2000 万个这些列表。

I ran benchmarks saving a ragged nested list with JSON, BSON, Numpy, and HDF5.我使用 JSON、BSON、Numpy 和 HDF5 运行基准测试来保存一个参差不齐的嵌套列表。

TLDR: use compressed JSON, because it is the most space efficient and easiest to encode/decode. TLDR:使用压缩的 JSON,因为它是最节省空间且最容易编码/解码的。

On the synthetic data, here are the results (with du -sh test* ):在合成数据上,结果如下(使用du -sh test* ):

4.5M    test.json.gz
7.5M    test.bson.gz
8.5M    test.npz
261M    test_notcompressed.h5
1.3G    test_compressed.h5

Compressed JSON is the most efficient in terms of storage, and it is also the easiest to encode and decode because the ragged list does not have to be converted to a mapping.压缩后的 JSON 在存储方面是最高效的,而且它也是最容易编码和解码的,因为不规则的列表不必转换为映射。 BSON comes in second, but it has to be converted to a mapping, which complicates encoding and decoding (and negating the encoding/decoding speed benefits of BSON over JSON). BSON 排在第二位,但它必须转换为映射,这使编码和解码变得复杂(并且否定了 BSON 相对于 JSON 的编码/解码速度优势)。 Numpy's compressed NPZ format is third best, but like BSON, the ragged list must be made into a dictionary before saving. Numpy 的压缩 NPZ 格式排在第三位,但和 BSON 一样,在保存之前必须将参差不齐的列表制作成字典。 HDF5 is surprisingly large, especially compressed. HDF5 非常大,尤其是压缩的。 This is probably because there are many different datasets, and compression adds overhead to each dataset.这可能是因为有许多不同的数据集,并且压缩会增加每个数据集的开销。


Benchmarks基准

Here is the relevant code for the benchmarking.这是基准测试的相关代码。 The bson package is part of pymongo . bson package 是pymongo的一部分。 I ran these benchmarks on a Debian Buster machine with an ext4 filesystem.我在带有ext4文件系统的 Debian Buster 机器上运行了这些基准测试。

def get_ragged_list(length=100000):
    """Return ragged nested list."""
    import random

    random.seed(42)
    l = []
    for _ in range(length):
        n_sublists = random.randint(1, 9)
        sublist = []
        for i in range(n_sublists):
            subsublist = [random.randint(0, 1000) for _ in range(random.randint(1, 9))]
            sublist.append(subsublist)
        l.append(sublist)
    return l

def save_json_gz(obj, filepath):
    import gzip
    import json

    json_str = json.dumps(obj)
    json_bytes = json_str.encode()
    with gzip.GzipFile(filepath, mode="w") as f:
        f.write(json_bytes)

def save_bson(obj, filepath):
    import gzip
    import bson

    d = {}
    for ii, n in enumerate(obj):
        for jj, nn in enumerate(n):
            key = f"{ii}/{jj}"
            d[key] = nn
    b = bson.BSON.encode(d)
    with gzip.GzipFile(filepath, mode="w") as f:
        f.write(b)

def save_numpy(obj, filepath):
    import numpy as np

    d = {}
    for ii, n in enumerate(obj):
        for jj, nn in enumerate(n):
            key = f"{ii}/{jj}"
            d[key] = nn
    np.savez_compressed(filepath, d)

def save_hdf5(obj, filepath, compression="lzf"):
    import h5py

    with h5py.File(filepath, mode="w") as f:
        for ii, n in enumerate(obj):
            for jj, nn in enumerate(n):
                name = f"{ii}/{jj}"
                f.create_dataset(name, data=nn, compression=compression)
ragged = get_ragged_list()

save_json_gz(ragged, "ragged.json.gz")
save_bson(ragged, "test.bson.gz")
save_numpy(ragged, "ragged.npz")
save_hdf5(ragged, "test_notcompressed.h5", compression=None)
save_hdf5(ragged, "test_compressed.h5", compression="lzf")

Versions of relevant packages:相关包的版本:

python 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 18:16:37) [GCC 7.3.0]
pymongo bson 3.10.1
numpy 1.18.2
h5py 2.10.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 创建列表列表。 修改列表中的元素 - Create a list of lists. Modify elements in the list 列表列表中的元素相等。 删除一个 - Equal elements in list of lists. Delete one 比较列表。 哪些元素不在列表中? - Comparing lists. Which elements are NOT in a list? 如何打印带有对齐文本的列表列表,rjust() 的整数长度来自列表中字符串的最大长度。 - How to print list of lists with justified text, the integer length to rjust() comes from the largest length of string in the lists. 最节省时间/空间的方法来检查整数列表中的所有元素是否为0 - Most time/space efficient way to check if all elements in list of integers are 0 列表列表和整数列表 - List of lists and integers into list 从多个列表和列表列表中获取唯一对象。 然后使用所有列表中的唯一对象创建一个新列表 - Get unique objects from multiple lists and lists of lists. Then create a new list with unique objects from all the lists 从q个元素的列表创建所有长度为“ n”的列表 - Creating all lists of length “n” from a list of q elements 列表 - 在从 CSV 导入的列表中将字符串更改为整数,而不是所有数据 - Lists - changing strings to integers in a list imported form CSV, not all data 检查列表列表的函数具有0到n之间的所有整数元素,并且列表是否都具有给定的长度? - Function to check list of lists have all integer elements between 0 and n and lists are all of given length?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM