简体   繁体   中英

What is the faster way to search for sequence of numbers in a 2d vector?

Given a 2d array( the array can be larger than 10k*10k ) with integer values, What is the faster way to search for a given sequence of numbers in the array?

Assume the 2d array which is in the file is read into a big 1d vector and is accessed as big_matrix(row*x+width). There are 3 types of searches I would like to do on the same 2d array. They are Search Ordered, Search Unordered, Search Best Match. Here's my approach to each of the search functions.

Search Ordered : This function finds all the rows in which given number sequence(order of numbers matters) is present. Here's the KMP method to find the given number sequence I implemented:

void searchPattern(std::vector<int> const &pattern, std::vector<int> const &big_matrix, int begin, int finish,
                         int width, std::vector<int> &searchResult) {

    auto M = (int) pattern.size();
    auto N = width; // size of one row

    while (begin < finish) {
        int i = 0;
        int j = 0;
        while (i < N) {
            if (pattern[j] == big_matrix[(begin * width) + i]) {
                j++;
                i++;
            }
            if (j == M) {
                searchResult[begin] = begin;
                begin++;
                break;
            } else if (i < N && pattern[j] != big_matrix[(begin * width) + i]) {
                if (j != 0)
                    j = lps[j - 1]; // lookup table as in KMP
                else
                    i = i + 1;
            }
        }
        if (j != M) {
            searchResult[begin] = -1;
            begin++;
        }
    }
}

Complexity: O(m*n); m is the number of rows, n is the number of cols

Search Unordered/Search Best Match : This function finds all the rows in which given number sequence is present(order of numbers doesn't matter). Here I am sorting the large array initially and will just sort only the input array during the search.

void SearchUnordered/BestMatch(std::vector<int> const &match, std::vector<int> const &big_matrix_sorted, int begin, int finish,
                     int width, std::vector<int> &searchResult) {
    std::vector<int>::iterator it;
    std::vector<int> v(match.size() + width);
    while (begin < finish) {
        it = std::set_intersection(match.begin(), match.end(), big_matrix_sorted.begin() + begin * width,
                                   big_matrix_sorted.begin() + begin * width + width, v.begin());
        v.resize(it - v.begin());
        if (v.size() == subseq.size())
        searchResult[begin] = begin;
        else
        searchResult[begin] = -1;
        begin++;
        /* For search best match the last few lines will change as follows:
      searchResult[begin] = (int) v.size();
      begin++; and largest in searchResult will be the result */
    }
}

Complexity: O(m*(l + n)); l - the length of the pattern, m is the number of rows, n is the number of cols.

Preprocessing of big_matrix (Constructing lookup table, storing a sorted version of it. You're allowed to do any pre-processing stuff.) is not taken into consideration . How can I improve the complexity(to O(log (m*n) ) of these search functions?

If you want to do it faster overall, but already have the right algorithm. You may get some performance by just optimising the code (memory allocations, removing duplicate operations if the compiler didn't etc.). For example there may be a gain by removing the two big_matrix[(row * width) + i] and assigning it to a local variable. Be careful to profile and measure realistic cases.

For bigger gains, threads can be an option. You can process here one row at a time, so should be roughly linear speedup with the number of cores. C++ 11 has std::async , which can handle some of the work for launching threads and getting results, rather than dealing with std::thread yourself or platform specific mechanisms. There are some other newer things that may be useful as well in newer versions of C++.

void searchPatternRow(std::vector<int> const &pattern, std::vector<int> const &big_matrix, int row, int width, std::vector<int> &searchResult);
void searchPattern(std::vector<int> const &pattern, std::vector<int> const &big_matrix, int begin, int finish, int width, std::vector<int> &searchResult)
{
    std::vector<std::future<void>> futures;
    for (int row = begin; row < finish; ++row)
        std::async([&, row]() { searchPatternRow(pattern, big_matrix, row, width, searchResult);  });
    for (auto &future : futures) future.wait(); // Note, also implicit when the future from async gets destructed
}

To improve threaded efficiency you may want to batch and search say 10 rows. There are also some considerations with threads writing to the same cache line for searchResult .

When searching for exact match, you can do this quite efficient by use of what I will call a " moving hash ".

When you search, you calculate a hash on your search string, and at the same time you keep calculating a moving hash on the data you are searching. When comparing you then first compares the hash, and only if that match, you then go on and compare the actual data.

Now the tick is to chose an hash algorithm that can easily be updated each time you move one spot, instead of recalculating everything. An example of such a hash is eg. the sum of all the digits.

If I have the following array: 012345678901234567890 and I want to find 34567 in this array, I could define the hash as the sum of all the digits in the search string. This would give a hash of 25 (3+4+5+6+7). I would then search through the array and keep updating a running hash on the array. The first hash in the array would be 10 (0+1+2+3+4) and the second hash would be 15 (1+2+3+4+5). But instead of recalculte the second hash, I can just update the previous hash by adding 5 (the new digit) and subtracting 0 (the old digit).

As updating the "running hash" is O(1) you can speed up the process considerable if you have a good Hash algorithm that don't give many false hits. The simple sum I use as hash is properbly too simple, but other methodes allows for this updating of the hash, eg XOR ..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM