简体   繁体   English

超快速的搜索整数子串的方法

[英]Ultra fast way to search for an integer substring

I am in a situation, where I need to find an integer substring into a range of extremely large strings. 我处于一种情况,需要在一个很大的字符串范围内找到一个整数子字符串。 I thought of using vector of vectors to store the range of integer strings and similarly I store the integer string to be searched in a vector. 我想到了使用vector的矢量来存储整数字符串的范围,并且类似地,我将要搜索的整数字符串存储在vector中。 Example below: 下面的例子:

//vector of 5 vectors
std::vector<std::vector<int>> vec(5);
// elements= {10,5,8,23,15,32,12,34,56,55,43,12,33,4}

and the substring into a vector 并将子字符串转换为vector

//vector with integer substring
std::vector<int> vec1;
//elements = {5,8,23}

and I use std::search to perform the search operation over the vector of vectors to find the vector , something like this 我使用std::searchvector of vectors执行搜索操作以找到vector ,类似这样

for( int i = 0; i < vec.size(); i++) // searching read into 
   {
     auto pos = std::search(vec[i].begin(), vec[i].end(), vec1.begin(), vec1.end());
// some more code
}

On testing it took about 1m to search for 1000 strings from the range of 10 vectors each of length 500000 . 在测试中,从10 vectors的范围中搜索每个长度为500000 1000 strings ,大约花费了1m的时间。

There are some data structures that are ultra fast such as unordered_map , but i doubt to use the data structure for my data. 有一些超快速的数据结构,例如unordered_map ,但是我怀疑要为我的数据使用该数据结构。 I would appreciate any suggestion or links to any container or data structure that are efficient in terms of both time and space. 我将不胜感激任何建议或指向任何在时间和空间上均有效的containerdata structure链接。

Note: 注意:

1) There is no possibility to sort the data, as I loose the data representation by sorting. 1)无法对数据进行排序,因为我通过排序失去了数据表示。

2) I am not searching for individual items, indeed for substrings of integers. 2)我不是在搜索单个项目,实际上不是在搜索整数的子字符串。

Edit 编辑

The original length of string may be 100000000 in each vector and the length of substrings 100 , 1million in number. 每个向量中字符串的原始长度可能是100000000 ,子字符串100的长度可能是100 1million

Here's my attempt at a fast solution -- on my 2.7GHz Mac mini it is able to find the locations of the 1000 "substrings" in 1357 milliseconds. 这是我尝试的一种快速解决方案-在我的2.7GHz Mac mini上,它能够在1357毫秒内找到1000个“子字符串”的位置。 It does this by first building up an index of all the locations where each integer appears in the big vectors, so that for each of the substrings it doesn't have to search everywhere, but instead only in locations where that substring might actually start. 为此,它首先建立一个在大向量中每个整数出现的所有位置的索引,这样,对于每个子字符串,它就不必在各处搜索,而只需在该子字符串可能实际开始的位置进行搜索。 One caveat is that the index takes up quite a bit of extra RAM, and takes some time to build; 需要注意的是,索引占用了大量额外的RAM,并且需要花费一些时间来构建。 so this may or may not be a practical solution, depending on your use case. 因此根据您的使用情况,这可能是实际解决方案,也可能不是。 (but note that it only has to be built once, unless/until you move on to searching a different set of big vectors) (但是请注意,除非/直到您继续搜索另一组大向量,否则它只需构建一次)

#include <algorithm>
#include <vector>
#include <cmath>
#include <cstdint>
#include <chrono>
#include <iostream>
#include <unordered_map>

using namespace std;

// Store a vector index and an offset into the vector efficiently
// Supports up to 256 vectors and offsets up to 16777216
static inline uint32_t GetVectorLocationKey(uint8_t whichVector, uint32_t offsetIntoVector)
{
   return ((((uint32_t)whichVector)<<24)|offsetIntoVector);
}

static inline void GetVectorLocationFromKey(uint32_t key, uint8_t & retWhichVector, uint32_t & retOffsetIntoVector)
{
   retWhichVector = (key >> 24) & 0xFF;
   retOffsetIntoVector = (key & 0xFFFFFF);
}

static inline bool SubstringExistsAtOffset(const int * bigVector, const vector<int> & substring)
{
   const int * smallVector = &substring[0];
   const size_t subLen = substring.size();
   for (size_t i=0; i<subLen; i++) if (bigVector[i] != smallVector[i]) return false;
   return true;
}

int main(int, char **)
{
   // Create some large vectors to search in
   vector<vector<int> > big_vectors;
   const size_t num_big_vectors = 5;
   const size_t big_vector_size = 500000;
   for (size_t i=0; i<num_big_vectors; i++)
   {
      big_vectors.push_back(vector<int>());
      vector<int> & v = big_vectors.back();
      for (size_t j=0; j<big_vector_size; j++) v.push_back(rand()%100);
   }

   // Pick out some small "substring" vectors to search for within the large vectors
   vector<vector<int> > substrings;
   const size_t num_substrings = 1000;
   const size_t substring_size = 14;
   for (size_t i=0; i<num_substrings; i++)
   {
      substrings.push_back(vector<int>());
      size_t whichBigVector = rand()%num_big_vectors;
      size_t offsetIntoVector = rand()%(big_vector_size-substring_size);
      vector<int> & v = substrings.back();
      const vector<int> & bigVector = big_vectors[whichBigVector];
      for (size_t j=0; j<substring_size; j++) v.push_back(bigVector[offsetIntoVector+j]);
   }

   // Now we'll build up a map so that for any given integer we'll
   // have immediate access to a list of the locations it is at.
   // That way we can jump immediately to those locations rather than
   // having to scan through the entire set of big_vectors
   unordered_map<int, vector<uint32_t> > index;
   for (size_t i=0; i<big_vectors.size(); i++)
   {
      const vector<int> & bigVector = big_vectors[i];
      for (size_t j=0; j<bigVector.size()-substring_size; j++)
      {
         int val = bigVector[j];
         index[val].push_back(GetVectorLocationKey(i, j));
      }
   }

   // Now for the time-critical part:  Let's see how fast we
   // can find our substrings within the larger vectors!
   std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
   vector<vector<uint32_t> > results;
   for (size_t i=0; i<substrings.size(); i++)
   {
      results.push_back(vector<uint32_t>());
      vector<uint32_t> & resultVec = results.back();

      const vector<int> & substring = substrings[i];
      const int firstVal = substring[0];
      const vector<uint32_t> & lookup = index[firstVal];
      for (size_t j=0; j<lookup.size(); j++)
      {
         const uint32_t key = lookup[j];
         uint8_t whichVector;
         uint32_t offsetIntoVector;
         GetVectorLocationFromKey(key, whichVector, offsetIntoVector);

         const vector<int> & bigVector = big_vectors[whichVector];
         if (SubstringExistsAtOffset(&bigVector[offsetIntoVector], substring)) resultVec.push_back(key);
      }
   }
   std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

   cout << " Total time spent finding " << substrings.size() << " substrings was " << std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count() << " milliseconds." << std::endl;

   cout << endl << endl << "RESULTS:" << endl;
   for(size_t i=0; i<results.size(); i++)
   {
      const vector<uint32_t> & result = results[i];
      for (size_t j=0; j<result.size(); j++)
      {
         const uint32_t key = result[j];
         uint8_t whichVector;
         uint32_t offsetIntoVector;
         GetVectorLocationFromKey(key, whichVector, offsetIntoVector);

         cout << "An instance of substring #" << i << " was found in bigVector #" << (int)whichVector << " at offset " << offsetIntoVector << endl;

         // Let's just double-check that the substring actually exists where I said it did
         // It would be embarrassing to find out I'm not actually finding them correctly :P
         const vector<int> & bigVector = big_vectors[whichVector];
         const vector<int> & substring = substrings[i];
         for (size_t k=0; k<substring.size(); k++)
         {
            if (bigVector[offsetIntoVector+k] != substring[k]) cout << "ERROR BAD RESULT in substring #" << i << " at offset " << k << endl;
         }
      }
   }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM