简体繁体 English

文件支持的Trie（或前缀树）实现

[英]File backed Trie (or Prefix Tree) implementation

原文 2009-11-06 06:41:12 1 2 c++/ boost/ data-structures/ trie

I have to store lot of strings in c++ map to keep unique strings and when ever duplicate string occurs I just need to increment the counter (pair.second). 我必须在c ++映射中存储很多字符串以保留唯一的字符串，并且一旦出现重复的字符串，我只需要增加计数器（pair.second）即可。 I've used c++ map and it well fits to this situation. 我使用过c ++映射，它非常适合这种情况。 Since the file that processing is gone now upto 30gig I am trying to keep this in a file instead of memory. 由于处理的文件现在已高达30gig，因此我试图将其保存在文件中而不是内存中。

I also came across trie which is faster than map in this case. 在这种情况下，我还遇到了trie，它比map更快。 Any one aware of file backed trie implementation? 有人知道文件支持的Trie实现吗？ I came across a Trie implementation similar to what I am looking for but not seems to be bug free .. 我遇到了一个Trie实现，它与我正在寻找的实现相似，但似乎没有错误。

2 个解决方案

How are you going to load 30GB into memory all at once? 如何将30GB一次全部加载到内存中？ And since it is a dictionary-based behavior you want, I'd imagine everytime you insert, or increment, you'll need to load the whole file (even if piece by piece) for lookup. 而且由于您想要的是基于字典的行为，所以我想像您每次插入或递增时，都需要加载整个文件（即使是逐段加载）以进行查找。

I suggest using a database. 我建议使用数据库。 That is what they're for... 那就是他们的目的...

If you can sort your file containing the strings, then reading the sorted list and counting duplicates would be easy. 如果您可以对包含字符串的文件进行排序 ，那么阅读排序列表和计算重复项将很容易。 (You can retain the original file and create a new file of sorted strings.) Sorting large files efficiently is old technology. （您可以保留原始文件，并创建一个新的已排序字符串的文件。）有效地对大型文件进行排序是旧技术。 You should be able to find a utility for that. 您应该能够找到一个实用程序。

If you can't sort , then consider digesting the strings. 如果您不能排序 ，请考虑摘要这些字符串。 MD5 may be overkill for your purpose. 对于您的目的，MD5可能会过大。 You can cobble something up. 你可以把东西弄平。 For billions of strings, you could use 8 byte digests. 对于数十亿个字符串，您可以使用8个字节的摘要。 Use a tree (probably a BST) of digests. 使用摘要树（可能是BST）。 For each digest, store the file offsets of the unique strings that produce that digest. 对于每个摘要，存储产生该摘要的唯一字符串的文件偏移量。

When you read a string, compute it's digest, and look it up. 当您读取一个字符串时，计算它的摘要，然后查找它。 If you don't find the digest, you know the string is unique. 如果找不到摘要，则说明该字符串是唯一的。 Store it in the tree. 将其存储在树中。 If you do find the digest, check each associated string for a match and handle accordingly. 如果找到摘要，请检查每个关联的字符串是否匹配，并进行相应处理。

To compare strings, you will need to go to the file, since all you've stored is the file offsets. 要比较字符串，您将需要转到文件，因为您存储的只是文件偏移量。

What's important to remember it that if two digests are different, the strings that produced them must be different. 重要的是要记住，如果两个摘要不同，则产生它们的字符串也必须不同。 If the digests are the same, the strings may not be the same, so you need to check. 如果摘要相同，则字符串可能不相同，因此需要检查。 This algorithm will be more efficient when there are fewer duplicate strings. 当重复的字符串较少时，该算法将更加有效。