简体   繁体   English

快速字符串搜索?

[英]Fast string search?

I have a vector of strings and have to check if each element in vector is present in a given list of 5000 words. 我有一个字符串向量,必须检查向量中的每个元素是否存在于5000个单词的给定列表中。 Besides the mundane method of two nested loops, is there any faster way to do this in C++? 除了两个嵌套循环的普通方法之外,有没有更快的方法在C ++中执行此操作?

You should put the list of strings into an std::set . 您应该将字符串列表放入std :: set It's a data structure optimized for searching. 它是为搜索而优化的数据结构。 Finding if a given element is in the set or not is an operation which is much faster than iterating all entries. 查找给定元素是否在集合中是一种比迭代所有条目快得多的操作。

When you are already using C++11, you can also use the std::unordered_set which is even faster for lookup, because it's implemented as a hash table. 当你已经在使用C ++ 11时,你也可以使用std :: unordered_set ,它更快地进行查找,因为它是作为哈希表实现的。

Should this be for school/university: Be prepared to explain how these data structures manage to be faster. 这应该适用于学校/大学:准备好解释这些数据结构如何变得更快。 When your instructor asks you to explain why you used them, "some guys on the internet told me" is unlikely to earn you a sticker in the class book. 当你的导师要求你解释你使用它们的原因时,“互联网上的一些人告诉我”不太可能在课本上给你一个贴纸。

You could put the list of words in an std::unordered_set . 您可以将单词列表放在std :: unordered_set中 Then, for each element in the vector, you just have to test if it is in the unordered_set in O(1). 然后,对于向量中的每个元素,您只需要测试它是否在O(1)中的unordered_set中。 You would have an expected complexity of O(n) (look at the comment to see why it is only expected). 你会有一个预期的复杂性O(n)(看看评论,看看为什么它只是预期)。

你可以对矢量进行排序,然后你可以用一个“循环”解决这个问题(你的字典也被排序),这意味着O(n)不计入排序成本。

So you have a vector of strings, with each string having one or more words, and you have a vector that's a dictionary, and you're supposed to determine which words in the vector of strings are also in the dictionary? 所以你有一个字符串向量,每个字符串都有一个或多个单词,你有一个字典的向量,你应该确定字符串向量中的哪些单词也在字典中? The vector of strings is an annoyance, since you need to look at each word. 字符串向量是一个烦恼,因为你需要查看每个单词。 I'd start by creating a new vector, splitting each string into words, and pushing each word into the new vector. 我首先创建一个新的向量,将每个字符串分成单词,然后将每个单词推入新的向量。 Then sort the new vector and run it through the std::unique algorithm to eliminate duplicates. 然后对新向量进行排序并通过std::unique算法运行它以消除重复。 Then sort the dictionary. 然后对字典进行排序。 Then run both ranges through std::set_intersection to write the result. 然后通过std::set_intersection运行两个范围来写入结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM