H5PY键读取速度很慢

Question

I've created a dataset with 1000 groups, each with 1300 uint8 arrays of varying lengths (though each one has a fixed size). 我创建了一个具有1000个组的数据集，每个组具有1300个长度不同的uint8数组（尽管每个数组都有固定的大小）。 Keys are strings of ~10 characters. 密钥是约10个字符的字符串。 I'm not trying to do anything tricky while saving (no chunking, compression etc - the data is already compressed). 我在保存时不做任何棘手的事情（没有分块，压缩等-数据已经被压缩）。

Iterating over all keys is extremely slow the first time I run a script, though speeds up significantly the second time (same script, different process called later), so I suspect there is some caching involved somehow. 第一次运行脚本时，对所有键进行迭代的速度非常慢，尽管第二次运行时会显着提高（相同的脚本，稍后会调用不同的进程），所以我怀疑某种程度上涉及了缓存。 After a while performance resets to the terrible level until I've waited it out again. 一段时间后，性能重置为糟糕的水平，直到我再次等待它。

Is there a way to store the data to alleviate this problem? 有没有一种方法可以存储数据来减轻此问题？ Or can I read it differently somehow? 还是我可以以某种方式不同地阅读它？

Simplified code to save 保存的简化代码

with h5py.File('my_dataset.hdf5', 'w') as fp:
    for k0 in keys0:
        group = fp.create_group(k0)
        for k1, v1 in get_items(k0):
            group.create_dataset(k1, data=np.array(v1, dtype=np.uint8))

Simplified key accessing code: 简化的密钥访问代码：

with h5py.File('my_dataset.hdf5', 'r') as fp:
    keys0 = fp.keys()
    for k0 in keys0:
        group = fp[k0]
        n += len(tuple(group.keys())

If I track the progress of this script during a 'slow phase', it takes almost a second for each iteration. 如果我在“慢速阶段”期间跟踪此脚本的进度，则每次迭代将花费几乎一秒钟的时间。 However, if I kill it after, say, 100 steps, then the next time I run the script the first 100 steps take < 1sec to run total, then performance drops back to a crawl. 但是，如果我在100个步骤后杀死了它，那么下次我运行该脚本时，前100个步骤花费的时间少于1秒，因此性能会下降到爬网状态。

Answer 1

While I'm still unsure why this is still slow, I've found a workaround: merge each sub-group into a single dataset 虽然我仍然不确定为什么它仍然很慢，但是我找到了一种解决方法：将每个子组合并到一个dataset

with h5py.File('my_dataset.hdf5', 'w') as fp:
    for k0 in keys0:
        subkeys = get_subkeys(k0)
        nk = len(subkeys)
        data = fp.create_dataset(
            'data', shape=(nk,),
             dtype=h5py.special_dtype(vlen=np.dtype(np.uint8)))
        keys = fp.create_dataset('keys', shape=(nk,), dtype='S32')
        for i, (k1, v1) in enumerate(get_items(k0)):
            keys[i] = k1
            data[i] = v1

H5PY键读取速度很慢

问题描述

1 个解决方案

解决方案1
0 2018-09-07 05:30:18

H5PY键读取速度很慢

问题描述

1 个解决方案

解决方案1 0 2018-09-07 05:30:18

解决方案1
0 2018-09-07 05:30:18