简体   繁体   English

使用PyTables进行HDF5的嵌套迭代

[英]Nested Iteration of HDF5 using PyTables

I am have a fairly large dataset that I store in HDF5 and access using PyTables. 我有一个相当大的数据集,存储在HDF5中,可以使用PyTables访问。 One operation I need to do on this dataset are pairwise comparisons between each of the elements. 我需要对此数据集执行的一个操作是每个元素之间的成对比较。 This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. 这需要2个循环,一个循环遍历每个元素,一个内部循环遍历每个其他元素。 This operation thus looks at N(N-1)/2 comparisons. 因此,该操作着眼于N(N-1)/ 2个比较。

For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. 对于相当小的集合,我发现将内容转储到多维numpy数组中然后进行迭代会更快。 I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. 由于内存问题,我遇到了大型集合的问题,需要在运行时访问数据集的每个元素。

Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. 将元素放入数组可以每秒获得约600次比较,而对hdf5数据本身进行操作则可以每秒获得约300次比较。

Is there a way to speed this process up? 有没有办法加快这个过程?

Example follows (this is not my real code, just an example): 示例如下(这不是我的真实代码,只是一个示例):

Small Set : 小套装

with tb.openFile(h5_file, 'r') as f:
    data = f.root.data

    N_elements = len(data)
    elements = np.empty((N_elements, 1e5))

    for ii, d in enumerate(data):
        elements[ii] = data['element']

D = np.empty((N_elements, N_elements))  
for ii in xrange(N_elements):
    for jj in xrange(ii+1, N_elements):             
        D[ii, jj] = compare(elements[ii], elements[jj])

Large Set : 大套

with tb.openFile(h5_file, 'r') as f:
    data = f.root.data

    N_elements = len(data)        

    D = np.empty((N_elements, N_elements))  
    for ii in xrange(N_elements):
        for jj in xrange(ii+1, N_elements):             
             D[ii, jj] = compare(data['element'][ii], data['element'][jj])

Two approaches I'd suggest here: 我在这里建议两种方法:

  1. numpy memmap: Create a memory mapped array, put the data inside this and then run code for "Small Set". numpy memmap:创建一个内存映射数组,将数据放入其中,然后运行“ Small Set”的代码。 Memory maps behave almost like arrays. 内存映射的行为几乎类似于数组。

  2. Use multiprocessing-module to allow parallel processing: if the "compare" method consumes at least a noticeable amount of CPU time, you could use more than one process. 使用multiprocessing-module允许并行处理:如果“比较”方法至少消耗了明显的CPU时间,则可以使用多个进程。

Assuming you have more than one core in the CPU, this will speed up significantly. 假设您的CPU中有多个内核,则速度将大大提高。 Use 采用

  • one process to read the data from the hdf and put in into a queue 一个从hdf读取数据并将其放入队列的过程
  • one process to grab from the queue and do the comparisson and put some result to "output-queue" 一个从队列中抓取并进行比较并将结果放入“输出队列”的过程
  • one process to collect the results again. 一种收集结果的过程。

Before choosing the way: "Know your enemy", ie, use profiling! 选择方式之前:“了解敌人”,即使用剖析! Optimizations are only worth the effort if you improve at the bottlenecks, so first find out which methods consume you precious CPU time. 只有在瓶颈处有所改善,优化才值得付出努力,因此,首先要找出哪些方法会浪费您宝贵的CPU时间。

Your algorithm is O(n^2), which is not good for large data. 您的算法为O(n ^ 2),对大数据不利。 Don't you see any chance to reduce this, eg, by applying some logic? 您是否没有机会通过应用某些逻辑来减少这种情况? This is always the best approach. 这始终是最好的方法。

Greetings, 问候,

Thorsten 索斯滕

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM