简体   繁体   English

搜索字符串的最快方法

[英]Fastest way to search for a string

I have 300 strings to be stored and searched for and that most of them are identical in terms of characters and lenght. 我有300个字符串要存储和搜索,并且大多数字符和长度都相同。 For Example i have string "ABC1","ABC2","ABC3" and so on. 例如我有字符串“ ABC1”,“ ABC2”,“ ABC3”,依此类推。 and another set like sample1,sample2,sample3. 另一个集合如sample1,sample2,sample3。 So i am kinda confused as of how to store them like to use an array or a hash table. 因此,对于如何存储它们(例如使用数组或哈希表),我有点困惑。 My main concern is the time i take to search for a string when i need to get one out from the storage. 我主要关心的是当我需要从存储中取出一个字符串时我需要花费的时间。 If i use an array i will have to do a string compare on all the index for me to arrive at one. 如果我使用数组,我将不得不对所有索引进行字符串比较,以使我得出一个。 Now if i go and impliment a hash table i will have to take care of collisions(obvious) and that i will have to impliment chaining for storing identical strings. 现在,如果我去隐含一个哈希表,我将不得不处理冲突(很明显),并且我必须隐含链接以存储相同的字符串。

So i am kind of looking for some suggestions weighing the pros and cons of each and arrive at the best practice 因此,我很想寻找一些建议,权衡每个建议的利弊并得出最佳做法

Because the keys are short tend to have a common prefix you should consider radix data structures such as the Patricia trie and Ternary Search Tree (google these, you'll find lots of examples) Time for searching these structures tends to be O(1) with respect to # of entries and O(n) with respect to length of the keys. 由于键较短,因此通常具有公共前缀,因此应考虑使用基数数据结构,例如Patricia trie和三元搜索树(在Google上,您会找到很多示例)。搜索这些结构的时间通常为O(1)关于条目数和关于密钥长度的O(n)。 Beware, however that long strings can use lots of memory. 但是要注意,长字符串会占用大量内存。

Search time is similar to hash maps if you don't consider collision resolution which is not a problem in a radix search. 如果您不考虑冲突解决率(在基数搜索中不是问题),则搜索时间类似于哈希图。 Note that I am considering the time to compute the hash as part of the cost of a hash map. 请注意,我正在考虑将哈希计算时间作为哈希图成本的一部分。 People tend to forget it. 人们往往会忘记它。

One downside is radix structures are not cache-friendly if your keys tend to show up in random order. 缺点之一是,如果您的密钥倾向于随机出现,则基数结构不适合缓存。 As someone mentioned, if the search time is really important: measure the performance of some alternative approaches. 就像有人提到的那样,如果搜索时间真的很重要:请测量一些替代方法的性能。

This depends on how much your data is changing. 这取决于您的数据更改量。 With that I mean, if you have 300 index strings which are referencing to another string, how often does those 300 index strings change? 我的意思是,如果您有300个索引字符串引用另一个字符串,那么这300个索引字符串多​​久更改一次?

You can use a std::map for quick lookups, but the map will require more resource when it is created the first time (compared to a array, vector or list). 您可以使用std :: map进行快速查找,但是在首次创建时(与数组,向量或列表相比),地图将需要更多资源。

I use maps mostly for some kind of dynamic lookup tables (for example: ip to socket). 我主要将地图用于某种动态查找表(例如:ip到套接字)。

So in your case it will look like this: 因此,在您的情况下,它将如下所示:

std::map<std::string, std::string> my_map;
my_map["ABC1"] = "sample1";
my_map["ABC2"] = "sample2";

std::string looked_up = my_map["ABC1"];

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM