简体   繁体   English

最佳数据结构,用于在C ++中存储和搜索短语

[英]Best Data structure to store and search phrases in C++

I use the tries data structure to store words. 我使用trys数据结构存储单词。 Now, I have a requirement which needs , to find, given a paragraph, if certain phrases are present in the same paragraph. 现在,我有一个要求,即在给定的段落中查找同一段落中是否存在某些短语。

What would be the most efficient way for doing this? 这样做最有效的方法是什么? The total number of phrases will not be more than 100. 短语总数不超过100。

If I were you, I would just throw something together using boost::multi_index_container first, because then if you get even more requirements later it will be quite easy to extend it further. 如果我是你,我会先使用boost :: multi_index_container来组合一些东西,因为那样的话,如果以后获得更多需求,将很容易进行进一步扩展。 If later you measure and find that it is not performing adequately, then you can replace it with an optimized data structure. 如果以后您测量并发现它的性能不足,则可以用优化的数据结构替换它。

The trie specified is suboptimal in numerous ways. 指定的特里在许多方面都不理想。

  • For a start, it constructs multiple nodes per item inserted. 首先,它为每个插入的项目构造多个节点。 As the author writes, "Every character of input key is inserted as an individual trie node." 正如作者所写,“输入键的每个字符都作为一个单独的Trie节点插入”。 That's a horrible, and unnecessary penalty! 那是可怕的,不必要的惩罚! The use of an ALPHABET_SIZE greater than 2 adds insult to injury here; 如果使用大于2的ALPHABET_SIZE ,则会ALPHABET_SIZE侮辱性伤害; not only would a phrase of fifty bytes require fifty nodes, but each node would likely be over one hundred bytes in size... Each item or "phrase" of fifty bytes in length might require up to about 5KB of storage using that code! 一个五十字节的短语不仅需要五十个节点,而且每个节点的大小可能超过一百个字节……使用该代码,长度为五十个字节的每个项目或“短语”可能最多需要约5KB的存储空间! That's not even the worst of it. 这还不是最坏的情况。
  • The algorithm provided embeds malloc internally, making it quite difficult to optimise. 该算法在内部提供了嵌入malloc ,因此很难进行优化。 Each node is its own allocation, making insert very malloc -heavy. 每个节点都是其自己的分配,这使得insert非常malloc -heavy。 Allocation details should be separated from data structure processing, if not for the purpose of optimisation then for simplicity of use. 分配细节应与数据结构处理分开,如果不是出于优化目的,则是为了简化使用。 Programs that use this code heavily are likely to run into performance issues related to memory fragmentation and/or cache misses, with no easy or significant optimisation in sight except for substituting the trie for something else. 大量使用此代码的程序很可能会遇到与内存碎片和/或高速缓存未命中有关的性能问题,除了用trie代替其他东西外,看不到任何简单或重大的优化。
  • That's not the only thing wrong here... This code isn't too portable , either! 这不是这里唯一的错误...该代码也不太可移植 If you end up running this on an old (not that old; they do still exist!) mainframe that uses EBCDIC rather than ASCII, this code will produce buffer overflows, and the programmer (you) will be called in to fix it. 如果你最终在一个古老的运行这个(不 ;!它们仍然存在),它使用EBCDIC而不是ASCII主机,该代码会产生缓冲区溢出,而程序员(你)会被称为解决它。 <sarcasm> That's so optimal, right? <sarcasm>吧? </sarcasm>

I've written a PATRICIA trie implementation that uses exactly one node per item, an alphabet size of two (it uses the bits of each character, rather than each character) and allows you to use whichever allocation you wish... alas, I haven't yet put a lot of effort into refactoring its interface, but it should be fairly close to optimal. 我已经编写了PATRICIA trie实现,该实现每个项目仅使用一个节点,字母大小为2(它使用每个字符的位,而不是每个字符),并允许您使用任何希望的分配... a,我在重构接口方面还没有付出很大的努力,但是它应该非常接近最优。 You can find that implementation here . 您可以在此处找到该实现。 You can see examples of inserting (using patricia_add ), retrieving (using patricia_get ) and removing (using patricia_remove ) in the patricia_test.c testcase file. 您可以在patricia_test.c测试用例文件中看到插入(使用patricia_add ),检索(使用patricia_get )和删除(使用patricia_remove )的示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM