简体   繁体   中英

Which data structure and algorithm is appropriate for this?

I have 1000's of string. Given a pattern that need to be searched in all the string, and return all the string which contains that pattern.

Presently i am using vector for to store the original strings. searching for a pattern and if matches add it into new vector and finally return the vector.

int main() {
    vector <string> v;
    v.push_back ("maggi");
    v.push_back ("Active Baby Pants Large 9-14 Kg ");
    v.push_back ("Premium Kachi Ghani Pure Mustard Oil ");
    v.push_back ("maggi soup");
    v.push_back ("maggi sauce");
    v.push_back ("Superlite Advanced Jar");
    v.push_back ("Superlite Advanced");
    v.push_back ("Goldlite Advanced"); 
    v.push_back ("Active Losorb Oil Jar"); 

    vector <string> result;

    string str = "Advanced";

    for (unsigned i=0; i<v.size(); ++i)
    {
        size_t found = v[i].find(str);
        if (found!=string::npos)
            result.push_back(v[i]);
    }

    for (unsigned j=0; j<result.size(); ++j)
    {
        cout << result[j] << endl;
    }
    // your code goes here
    return 0;

}

Is there any optimum way to achieve the same with lesser complexity and higher performance ??

The containers I think are appropriate for your application.

However instead of std::string::find , if you implement your own KMP algorithm , then you can guarantee the time complexity to be linear in terms of the length of string + search string.
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

As such the complexity of std::string::find is unspecified.
http://www.cplusplus.com/reference/string/string/find/

EDIT: As pointed out by this link, if the length of your strings is not large (more than 1000), then probably using std::string::find would be good enough since here tabulation etc is not needed.
C++ string::find complexity

If the result is used in the same block of code as the input string vector (it is so in your example) or even if you have a guarantee that everyone uses the result only while input exists, you don't need actually to copy strings. It could be an expensive operation, which considerably slows total algorithm.

Instead you could have a vector of pointers as the result:

vector <string*> result;

If the list of strings is "fixed" for many searches then you can do some simple preprocessing to speed up things quite considerably by using an inverted index.

Build a map of all chars present in the strings, in other words for each possible char store a list of all strings containing that char:

std::map< char, std::vector<int> > index;
std::vector<std::string> strings;

void add_string(const std::string& s) {
    int new_pos = strings.size();
    strings.push_back(s);
    for (int i=0,n=s.size(); i<n; i++) {
        index[s[i]].push_back(new_pos);
    }
}

Then when asked to search for a substring you first check for all chars in the inverted index and iterate only on the list in the index with the smallest number of entries:

std::vector<std::string *> matching(const std::string& text) {
    std::vector<int> *best_ix = NULL;
    for (int i=0,n=text.size(); i<n; i++) {
        std::vector<int> *ix = &index[text[i]];
        if (best_ix == NULL || best_ix->size() > ix->size()) {
            best_ix = ix;
        }
    }

    std::vector<std::string *> result;
    if (best_ix) {
        for (int i=0,n=best_ix->size(); i<n; i++) {
            std::string& cand = strings[(*best_ix)[i]];
            if (cand.find(text) != std::string::npos) {
                result.push_back(&cand);
            }
        }
    } else {
        // Empty text as input, just return the whole list
        for (int i=0,n=strings.size(); i<n; i++) {
            result.push_back(&strings[i]);
        }
    }
    return result;
}

Many improvements are possible:

  • use a bigger index (eg using pairs of consecutive chars)
  • avoid considering very common chars (stop lists)
  • use hashes computed from triplets or longer sequences
  • search the intersection instead of searching the shorter list. Given the elements are added in order the vectors are anyway already sorted and intersection could be computed efficently even using vectors (see std::set_intersection ).

All of them may make sense or not depending on the parameters of the problem (how many strings, how long, how long is the text being searched ...).

If the source text is large and static (eg crawled webpages), then you can save search time by pre-building a suffix tree or a trie data structure. The search pattern can than traverse the tree to find matches.

If the source text is small and changes frequently, then your original approach is appropriate. The STL functions are generally very well optimized and have stood the test of time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM