优化这种“统计巧合”查找算法

Question

Goal目标

The code below is designed to take in a vector<vector<float> > of random numbers from a Gaussian distribution, and perform the following:下面的代码旨在从高斯分布中获取随机数的vector<vector<float> > ，并执行以下操作：

Iterate simultaneously through all n columns of the vector until you encounter the first value such exceeding some threshold.同时遍历向量的所有n列，直到遇到第一个超过某个阈值的值。
Continue iterating until either a) you encounter a second value exceeding that threshold such that that value comes from a different column that the first found value, or b) you exceed some maximum number of iterations.继续迭代，直到 a) 您遇到超过该阈值的第二个值，使得该值来自与第一个找到的值不同的列，或者 b) 您超过某个最大迭代次数。
In the case of a), continue iterating until either c) you find a third value exceeding the threshold such that the value comes from a different column than the first found value and the second found value, or b) you exceed some maximum number of iterations from the first found value.在 a) 的情况下，继续迭代直到 c) 您发现第三个值超过阈值，使得该值来自与第一个找到的值和第二个找到的值不同的列，或者 b) 您超过了某个最大数量从第一个找到的值开始迭代。 In the case of b) start over again, except this time start iterating at one row after the first found value.在 b) 的情况下重新开始，除了这次在第一个找到的值之后的一行开始迭代。
In the case of c), add one to a counter, and jump forward some x rows.在 c) 的情况下，将一个计数器加一，并向前跳转一些x行。 In the case of d), start over, except this time start iterating at one row after the first found value.在 d) 的情况下，重新开始，除了这次在第一个找到的值之后的一行开始迭代。

How I accomplish this:我如何做到这一点：

In my opinion, the most challenging part is making sure all three values are contributed by a unique column.在我看来，最具挑战性的部分是确保所有三个值都由一个独特的列提供。 To tackle this, I used std::set .为了解决这个问题，我使用了std::set 。 I iterate through each row of the vector<vector<float> > , then iterate through each column of that row.我遍历vector<vector<float> >的每一行，然后遍历该行的每一列。 I check each column for a value exceeding the threshold, and store it's columnar number in an std::set.我检查每列是否有超过阈值的值，并将其列数存储在 std::set 中。

I continue iterating.我继续迭代。 If I reach max_iterations , I jump back to one after the first-found value, empty the set, and reset the counter.如果我达到max_iterations ，我跳回到第一个找到的值之后的一个，清空集合，并重置计数器。 If the std::set has size 3 , I add one to the counter.如果std::set的大小为3 ，我将一个添加到计数器。

My issue:我的问题：

This code will need to run on multidimensional vectors of sizes on the order of tens of columns and hundreds of thousands to millions of rows.此代码将需要在大小为数十列和数十万到数百万行的多维向量上运行。 As of now, that's excruciatingly slow.截至目前，这是极其缓慢的。 I'd like to improve performance significantly, if possible.如果可能的话，我想显着提高性能。

My code:我的代码：

void findRate(float thresholdVolts){

    set<size_t> cache;
    vector<size_t> index;

    size_t count = 0, found = 0;

    for(auto rowItr = waveform.begin(); rowItr != waveform.end(); ++rowItr){

        auto &row = *rowItr;

        for(auto colnItr = row.begin(); colnItr != row.end(); ++colnItr){

            auto &cell = *colnItr;

            if(abs(cell/rmsVoltage) >= (thresholdVolts/rmsVoltage)){
                cache.insert(std::distance(row.begin(), colnItr));
                index.push_back(std::distance(row.begin(), colnItr));
            }

        }

        if(cache.size() == 0) count == 0;

        if(cache.size() == 3){

            ++found;
            cache.clear();

            if(std::distance(rowItr, output.end()) > ((4000 - count) + 4E+6)){
                std::advance(rowItr, ((4000 - count) + 4E+6));
            }

        }


    }

}

Answer 1

One thing you could do right away, in your inner loop.您可以在内部循环中立即做一件事。 I understand that rmsVoltage is an external variable that is constant durng execution of the function.我知道 rmsVoltage 是一个外部变量，它在 function 的执行过程中是恒定的。

    for(auto colnItr = row.begin(); colnItr != row.end(); ++colnItr){

        auto &cell = *colnItr;
      
        // you can remove 2 divisions here.  Divisions are the slowest
        // arithmetic instructions on any cpu
        //
        //  this: 
        //    if(abs(cell/rmsVoltage) >= (thresholdVolts/rmsVoltage)){
        //
        // becomes this
        if (abs(cell) >= thresholdVolts) {
            cache.insert(std::distance(row.begin(), colnItr));
            index.push_back(std::distance(row.begin(), colnItr));
        }

And a bit below: why are you adding a floating point constant to a size_t??下面一点：为什么要向 size_t 添加浮点常量？ This may cause unnecessary conversions of size_t to double and then back to size_t, some compilers may hande this, but definitely not all.这可能会导致 size_t 的不必要转换加倍，然后返回到 size_t，一些编译器可能会处理这个问题，但绝对不是全部。

These are relatively costly operations.这些是相对昂贵的操作。

        // this:
        //  if(std::distance(rowItr, output.end()) > ((4000 - count) + 4E+6)){
        //    std::advance(rowItr, ((4000 - count) + 4E+6));
        //  }

        if (std::distance(rowItr, output.end()) > (4'004'000 - count))
            std::advance(rowItr, 4'004'000 - count);

Also, after observing the needs in memory for your function, you should preallocate some reasonable space for containers cache and index, using vector<>::reserve(), and set<>::reserve().此外，在观察 memory 中对您的 function 的需求后，您应该使用 vector<>::reserve() 和 set<>::reserve() 为容器缓存和索引预分配一些合理的空间。

Did you give us the entire algorithm?你给我们整个算法了吗？ The contents of container index are not used anywhere.容器索引的内容不会在任何地方使用。

Please let me know how much time you've gained with these changes.请让我知道您通过这些更改获得了多少时间。

优化这种“统计巧合”查找算法

问题描述

1 个解决方案

解决方案1
0 2020-11-13 04:06:51

优化这种“统计巧合”查找算法

问题描述

1 个解决方案

解决方案1 0 2020-11-13 04:06:51

解决方案1
0 2020-11-13 04:06:51