优化这种“巧合搜索”算法，以提高速度

Question

I've written an algorithm, designed to simulate a data produced by an experiment, and then perform a "coincidence search" on that data (more on that in a moment...).我编写了一个算法，旨在模拟实验产生的数据，然后对该数据执行“巧合搜索”（稍后会详细介绍......）。 The data in question is a vector<vector<double> > , with elements picked from a Gaussian distribution (more-or-less, random numbers).有问题的数据是一个vector<vector<double> > ，其元素是从高斯分布（或多或少的随机数）中挑选出来的。 Each "column" represents a "data stream", and each row an instant in time.每个“列”代表一个“数据流”，每一行代表一个瞬间。 The "location" of each element in the "array" must be preserved.必须保留“数组”中每个元素的“位置”。

The Algorithm:算法：

The algorithm is designed to perform the following task:该算法旨在执行以下任务：

Iterate simultaneously through all n columns (data streams), and count the number of times at least c unique columns have an element with an absolute value greater than some threshold, such that the elements lie in a specified time interval (ie a certain number of rows).同时遍历所有n列（数据流），并计数至少c唯一列具有绝对值大于某个阈值的元素的次数，使得元素位于指定的时间间隔（即一定数量的行）。

When this occurs, we add one to a counter, and then jump forward in time (row-wise) by some specified amount.当这种情况发生时，我们将一个计数器加一，然后在时间上（按行）向前跳转某个指定的数量。 We start over again, until we've traversed the entire "array".我们重新开始，直到我们遍历了整个“数组”。 Finally, we return the value of the counter (the "number of coincidences").最后，我们返回计数器的值（“符合次数”）。

My solution:我的解决方案：

I give the code first, then step through it piece-by-piece and explain it's operation (and, also hopefully clarifying some details):我先给出代码，然后一步一步地解释它的操作（并且，也希望澄清一些细节）：

size_t numOfCoincidences(vector<vector<double>> array, double value_threshold, size_t num_columns){

    set<size_t> cache;
    size_t coincidence_counter = 0, time_counter = 0;

    auto exceeds_threshold = [&](double element){ return fabs(element) >= value_threshold; };

    for(auto row_itr = begin(array); row_itr != end(row_itr); ++row_itr){

        auto &row = *row_itr;

        auto coln_itr = std::find_if(execution::par_unseq, begin(row), end(row), exceeds_threshold);
        while(coln_itr != row.end()){
            cache.insert(distance(begin(row), coln_itr));
            coln_itr = std::find_if(next(coln_itr), end(row), exceeds_threshold);
        }

        if(size(cache) >= num_columns){

            ++coincidence_counter;
            cache.clear();

            if(distance(row_ctr, end(waveform)) > (4004000 - time_counter)){
                advance(row_ctr, ((4004000 - time_counter)));
            } else {
                return coincidence_counter;
            }

        }


        if(time_counter == time_threshold){
            row_itr -= (time_counter + 1);
            cache.clear();
        }


        ++time_counter;


    }

    if(cache.size() == 0) time_counter = 0;

    return(coincidence_counter);

}

*How it works...**这个怎么运作...*

I iterate through the data ( vector<vector<double> > array ) row-wise:我逐行遍历数据（ vector<vector<double> > array ）：

for(auto row_itr = begin(array); row_itr;= end(row_itr); ++row_itr)

For each row, I use std::find_if to get every element exceeding the value threshold ( value_threshold ):对于每一行，我使用std::find_if来获取超过值阈值（ value_threshold ）的每个元素：

        auto coln_itr = std::find_if(execution::par_unseq, begin(row), end(row), exceeds_threshold);
        while(coln_itr != row.end()){
            cache.insert(distance(begin(row), coln_itr));
            coln_itr = std::find_if(next(coln_itr), end(row), exceeds_threshold);
        }

What I'm after is the columnar index, so I use std::distance to get that and store it in an std::set , cache .我所追求的是柱状索引，所以我使用std::distance来获取它并将其存储在std::set 、 cache中。 I choose std::set here because I'm interested in counting the number of unique columns that have a value exceeding value_threshold , within some time (ie, row) interval.我在这里选择std::set是因为我有兴趣在某个时间（即行）间隔内计算值超过value_threshold的唯一列的数量。 By using std::set , I can just dump the columnar index of every such value, and duplicates are "automatically removed".通过使用std::set ，我可以转储每个此类值的列索引，并且“自动删除”重复项。 Then, later, I can simply check the size of cache and, if it's greater than or equal to the specified number ( num_columns ), I've found a "coincidence".然后，稍后，我可以简单地检查cache的大小，如果它大于或等于指定的数字（ num_columns ），我发现了一个“巧合”。

After getting the columnar index of every value exceeding value_threshold , I check the size of cache to see if I've found enough unique columns.在获得超过value_threshold的每个值的列索引后，我检查cache的大小以查看是否找到了足够的唯一列。 If I have, I add one to the coincidence_counter , I clear the cache , then jump forward in "time" (ie, rows) by some specified amount (here, 4004000 - time_counter ).如果有，我将一个添加到coincidence_counter计数器，我清除cache ，然后在“时间”（即行）中向前跳转某个指定量（此处为4004000 - time_counter ）。 Notice that I subtract time_counter , which keeps track of the "time" (# of rows) from the first found value(s) exceeding value_threshold .请注意，我减去time_counter ，它从第一个找到的超过value_threshold的值中跟踪“时间”（行数）。 I want to jump forward in time from that starting point.我想从那个起点及时向前跳跃。

        if(size(cache) >= num_columns){

            ++coincidence_counter;
            cache.clear();

            if(distance(row_ctr, end(waveform)) > (4004000 - time_counter)){
                advance(row_ctr, ((4004000 - time_counter)));
            } else {
                return coincidence_counter;
            }

        }

Finally, I check time_counter .最后，我检查time_counter 。 Remember that the num_columns unique columns must be within some time (ie, row) threshold of one another.请记住， num_columns唯一列必须在某个时间（即行）阈值之内。 I start that time count from the first found value exceeding value_threshold .我从第一个发现的超过value_threshold的值开始计算时间。 If I've exceeded the time threshold, what I want to do is empty cache() , and start over using the second-found value exceeding the value threshold (if there is one) as the new first-found value, and hopefully find a coincidence using that as the starting point.如果我已经超过了时间阈值，我想做的是清空cache() ，并使用超过值阈值（如果有的话）的第二个找到的值作为新的第一个找到的值，并希望找到以此为起点的巧合。

Instead of keeping track of the time (ie, row index) of each found value, I simply start over at one after the first-found value (ie, time_counter + 1 ).我没有跟踪每个找到的值的时间（即行索引），而是简单地从第一个找到的值（即time_counter + 1 ）之后的一个开始。

        if(time_counter == time_threshold){
            row_itr -= (time_counter + 1);
            cache.clear();
        }

I also add one to time_counter with each loop, and set it equal to 0 if cache has size 0 (I want to start counting time (ie, rows) from the first-found value exceeding value_threshold ).我还在每个循环time_counter添加一个，如果cache大小0 0我想从超过value_threshold的第一个找到的值开始计算时间（即行）），则将其设置为 0。

Attempted Optimizations:尝试的优化：

I'm not sure if these have helped, hurt, or otherwise, however here's what I've tried (with little success)我不确定这些是否有帮助、伤害或其他方面，但这是我尝试过的（收效甚微）

I've replaced all int and unsigned int with size_t .我已经用size_t替换了所有int和unsigned int 。 I understand that this may be ever so slightly faster, and these values should never be less than 0 anyhow.我知道这可能会稍微快一点，而且这些值无论如何都不应该小于0 。

I've also used execution::par_unseq with std::find_if .我还将execution::par_unseq与std::find_if一起使用。 I'm not sure how much this helps.我不确定这有多大帮助。 The "array" typically has about 16-20 columns, but an exceptionally large number of rows (on the order of 50000000 or more). “数组”通常有大约16-20列，但行数非常多（大约50000000或更多）。 Since std::find_if is "scanning" individual rows, which only have tens of elements, at most, perhaps parallelization isn't helping much.由于std::find_if正在“扫描”单个行，这些行最多只有几十个元素，因此并行化可能没有多大帮助。

Goals:目标：

Unfortunately, the algorithm takes an exceptionally long time to run.不幸的是，该算法需要非常长的时间才能运行。 My utmost priority is speed .我的首要任务是速度。 If possible, I'd like to cut the execution time in half.如果可能的话，我想将执行时间减半。

Some things to keep in mind: The "array" is typically on the order of ~20 columns by ~50000000 rows (sometimes longer).需要记住的一些事情：“数组”通常是~20列乘~50000000行（有时更长）。 It has very few 0's , and cannot be re-arranged (the order of the "rows", and elements in each row, matters).它0's很少，并且不能重新排列（“行”的顺序和每行中的元素很重要）。 It takes up (unsurprisingly) a ton of memory, and my machine is therefore quite resource constrained.它占用了（毫不奇怪）大量的 memory，因此我的机器资源非常有限。

I'm also running this as interpreted C++ , in cling .我也在cling中将其作为解释的C++运行。 In my work, I've never used compiled C++ much.在我的工作中，我从来没有使用过编译的C++ 。 I've tried compiling, however it hasn't helped much.我试过编译，但没有太大帮助。 I've also tried playing with compiler optimization flags.我也尝试过使用编译器优化标志。

What can be done to cut execution time (at the expense of virtually anything else?)可以做些什么来缩短执行时间（以牺牲几乎其他任何东西为代价？）

Please, let me know if I can offer any additional information to assist in answering the question.请让我知道我是否可以提供任何其他信息来帮助回答问题。

Answer 1

This code seems like it might be memory bandwidth bound regardless, but I'd try removing the fancy algorithm stuff in favor of a windowed count.这段代码似乎可能是 memory 带宽限制，但我会尝试删除花哨的算法内容以支持窗口计数。 Untested C++:未经测试的 C++：

#include <algorithm>
#include <cmath>
#include <vector>

using std::fabs;
using std::size_t;
using std::vector;

size_t NumCoincidences(const vector<vector<double>> &array,
                       double value_threshold, size_t num_columns) {
  static constexpr size_t kWindowSize = 4004000;
  const auto exceeds_threshold = [&](double x) {
    return fabs(x) >= value_threshold;
  };
  size_t start = 0;
  std::vector<size_t> num_exceeds_in_window(array[0].size());
  size_t num_coincidences = 0;
  for (size_t i = 0; i < array.size(); i++) {
    const auto &row = array[i];
    for (size_t j = 0; j < row.size(); j++) {
      num_exceeds_in_window[j] += exceeds_threshold(row[j]) ? 1 : 0;
    }
    if (i >= start + kWindowSize) {
      const auto &row = array[i - kWindowSize];
      for (size_t j = 0; j < row.size(); j++) {
        num_exceeds_in_window[j] -= exceeds_threshold(row[j]) ? 1 : 0;
      }
    }
    size_t total_exceeds_in_window = 0;
    for (size_t n : num_exceeds_in_window) {
      total_exceeds_in_window += n > 0 ? 1 : 0;
    }
    if (total_exceeds_in_window >= num_columns) {
      start = i + 1;
      std::fill(num_exceeds_in_window.begin(), num_exceeds_in_window.end(), 0);
      num_coincidences++;
    }
  }
  return num_coincidences;
}

优化这种“巧合搜索”算法，以提高速度

问题描述

*How it works...**这个怎么运作...*

1 个解决方案

解决方案1
1 2021-01-07 15:09:45

优化这种“巧合搜索”算法，以提高速度

问题描述

How it works...这个怎么运作...

1 个解决方案

解决方案1 1 2021-01-07 15:09:45

*How it works...**这个怎么运作...*

解决方案1
1 2021-01-07 15:09:45