简体   繁体   English

H5py数据集上的循环性能问题

[英]Performance issue with loop on datasets with h5py

I want to apply a simple function to the datasets contained in an hdf5 file. 我想对hdf5文件中包含的数据集应用简单的功能。 I am using a code similar to this 我正在使用与此类似的代码

import h5py
data_sums = []

with h5py.File(input_file, "r") as f:
    for (name, data) in f["group"].iteritems():
        print name
        # data_sums.append(data.sum(1))
        data[()]  # My goal is similar to the line above but this line is enough
                  # to replicate the problem

It goes very fast at the beginning and after a certain number, reproducible to some extent, of datasets it slow down dramatically. 它在开始时运行非常快,并且在一定数量(在某种程度上可以重现)的数据集之后,它显着放慢了速度。 If I comment the last line, it finishes almost instantly. 如果我评论最后一行,则几乎立即完成。 It does not matter if the data are stored (here append to a list) or not: something like data[:100] as a similar effect. 数据是否存储(在此追加到列表中)与否都无所谓:类似data [:100]之类的效果。 The number of datasets that can be treated before the drop in performance is dependent to the size of portion that is accessed at each iteration. 性能下降之前可以处理的数据集数量取决于每次迭代访问的部分的大小。 Iterating over smaller chunks does not solve the issue. 迭代较小的块并不能解决问题。

I suppose I am filling some memory space and that the process slows down when it is full but I do not understand why. 我想我正在填充一些内存空间,该进程在满时会变慢,但我不明白为什么。

How to circumvent this performance issue? 如何避免此性能问题?

I run python 2.6.5 on ubuntu 10.04. 我在ubuntu 10.04上运行python 2.6.5。

Edit: The following code does not slow down if the second line of the loop is un-commented. 编辑:如果循环的第二行未注释,以下代码不会减慢速度。 It does slow down without out it 它的确放慢了速度

f = h5py.File(path to file, "r")
list_name = f["data"].keys()
f.close()

import numpy as np

for name in list_name:
    f = h5py.File(d.storage_path, "r")
    # name = list_name[0] # with this line the issue vanishes.
    data = f["data"][name]
    tag = get_tag(name)
    data[:, 1].sum()
    print "."

    f.close()

Edit: I found out that accessing the first dimension of multidimensional datasets seems to run without issues. 编辑:我发现访问多维数据集的第一维似乎没有问题。 The problem occurs when higher dimensions are involved. 当涉及到更大的尺寸时,就会出现此问题。

platform? 平台?

on windows 64 bit, python 2.6.6, i have seen some weird issues when crossing a 2GB barrier (i think) if you have allocated it in small chunks. 在Windows 64位,python 2.6.6上,如果您将其分配为小块,则在跨越2GB障碍时(我认为),我已经看到一些奇怪的问题。

you can see it with a script like this: 您可以使用以下脚本查看它:

ix = []
for i in xrange(20000000):
    if i % 100000 == 0:
        print i
    ix.append('*' * 1000)

you can see that it will run pretty fast for a while, and then suddenly slow down. 您会看到它会运行一会儿很快,然后突然变慢。

but if you run it in larger blocks: 但是,如果您以较大的块运行它:

ix = []
for i in xrange(20000):
    if i % 100000 == 0:
        print i
    ix.append('*' * 1000000)

it doesn't seem to have the problem (though it will run out of memory, depending on how much you have - 8GB here). 它似乎没有问题(尽管它会耗尽内存,具体取决于您有多少-8GB)。

weirder yet, if you eat the memory using large blocks, and then clear the memory (ix=[] again, so back to almost no memory in use), and then re-run the small block test, it isn't slow anymore. 奇怪的是,如果您使用大块吃掉了内存,然后清除了内存(再次使用ix = [],因此几乎没有使用任何内存),然后重新运行小块测试,那就不再慢了。

i think there was some dependence on the pyreadline version - 2.0-dev1 helped a lot with these sorts of issues. 我认为有一些对pyreadline版本的依赖-2.0-dev1在解决这类问题方面起到了很大作用。 but don't remember too much. 但不要记住太多。 when i tried it now, i don't see this issue anymore really - both slow down significant around 4.8GB, which with everything else i have running is about where it hits the limits of physical memory and starts swapping. 当我现在尝试时,我真的再也看不到这个问题了-两者都显着降低了4.8GB,这与我运行的所有其他东西有关,它达到了物理内存的极限并开始交换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM