[英]Space efficient data store for list of list of lists. Elements are integers, and size of all lists varies in length
Say my data looks like this说我的数据看起来像这样
thisList = [
[[13, 43, 21, 4], [33, 2, 111, 33332, 23, 43, 2, 2], [232, 2], [23, 11]] ,
[[21, 2233, 2], [2, 3, 2,1, 32, 22], [3]],
[[3]],
[[23, 12], [55, 3]],
....
]
What is the most space-efficient way to store this time of data?存储这段时间数据最节省空间的方式是什么?
I looked at Numpy files, but numpy only supports uniform length data我查看了 Numpy 文件,但 numpy 只支持统一长度数据
I looked at Hdf5, which has support for 1d ragged tensors, but not 2d我查看了 Hdf5,它支持 1d 不规则张量,但不支持 2d
https://stackoverflow.com/a/42659049/3259896 https://stackoverflow.com/a/42659049/3259896
So there's an option of creating a separate hdf5 file for every list in thisList
, but I would have potentially 10-20 million those lists.因此,可以选择为thisList
中的每个列表创建一个单独的 hdf5 文件,但我可能有 10-2000 万个这些列表。
I ran benchmarks saving a ragged nested list with JSON, BSON, Numpy, and HDF5.我使用 JSON、BSON、Numpy 和 HDF5 运行基准测试来保存一个参差不齐的嵌套列表。
TLDR: use compressed JSON, because it is the most space efficient and easiest to encode/decode. TLDR:使用压缩的 JSON,因为它是最节省空间且最容易编码/解码的。
On the synthetic data, here are the results (with du -sh test*
):在合成数据上,结果如下(使用du -sh test*
):
4.5M test.json.gz
7.5M test.bson.gz
8.5M test.npz
261M test_notcompressed.h5
1.3G test_compressed.h5
Compressed JSON is the most efficient in terms of storage, and it is also the easiest to encode and decode because the ragged list does not have to be converted to a mapping.压缩后的 JSON 在存储方面是最高效的,而且它也是最容易编码和解码的,因为不规则的列表不必转换为映射。 BSON comes in second, but it has to be converted to a mapping, which complicates encoding and decoding (and negating the encoding/decoding speed benefits of BSON over JSON). BSON 排在第二位,但它必须转换为映射,这使编码和解码变得复杂(并且否定了 BSON 相对于 JSON 的编码/解码速度优势)。 Numpy's compressed NPZ format is third best, but like BSON, the ragged list must be made into a dictionary before saving. Numpy 的压缩 NPZ 格式排在第三位,但和 BSON 一样,在保存之前必须将参差不齐的列表制作成字典。 HDF5 is surprisingly large, especially compressed. HDF5 非常大,尤其是压缩的。 This is probably because there are many different datasets, and compression adds overhead to each dataset.这可能是因为有许多不同的数据集,并且压缩会增加每个数据集的开销。
Here is the relevant code for the benchmarking.这是基准测试的相关代码。 The bson
package is part of pymongo
. bson
package 是pymongo
的一部分。 I ran these benchmarks on a Debian Buster machine with an ext4
filesystem.我在带有ext4
文件系统的 Debian Buster 机器上运行了这些基准测试。
def get_ragged_list(length=100000):
"""Return ragged nested list."""
import random
random.seed(42)
l = []
for _ in range(length):
n_sublists = random.randint(1, 9)
sublist = []
for i in range(n_sublists):
subsublist = [random.randint(0, 1000) for _ in range(random.randint(1, 9))]
sublist.append(subsublist)
l.append(sublist)
return l
def save_json_gz(obj, filepath):
import gzip
import json
json_str = json.dumps(obj)
json_bytes = json_str.encode()
with gzip.GzipFile(filepath, mode="w") as f:
f.write(json_bytes)
def save_bson(obj, filepath):
import gzip
import bson
d = {}
for ii, n in enumerate(obj):
for jj, nn in enumerate(n):
key = f"{ii}/{jj}"
d[key] = nn
b = bson.BSON.encode(d)
with gzip.GzipFile(filepath, mode="w") as f:
f.write(b)
def save_numpy(obj, filepath):
import numpy as np
d = {}
for ii, n in enumerate(obj):
for jj, nn in enumerate(n):
key = f"{ii}/{jj}"
d[key] = nn
np.savez_compressed(filepath, d)
def save_hdf5(obj, filepath, compression="lzf"):
import h5py
with h5py.File(filepath, mode="w") as f:
for ii, n in enumerate(obj):
for jj, nn in enumerate(n):
name = f"{ii}/{jj}"
f.create_dataset(name, data=nn, compression=compression)
ragged = get_ragged_list()
save_json_gz(ragged, "ragged.json.gz")
save_bson(ragged, "test.bson.gz")
save_numpy(ragged, "ragged.npz")
save_hdf5(ragged, "test_notcompressed.h5", compression=None)
save_hdf5(ragged, "test_compressed.h5", compression="lzf")
Versions of relevant packages:相关包的版本:
python 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 18:16:37) [GCC 7.3.0]
pymongo bson 3.10.1
numpy 1.18.2
h5py 2.10.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.