在大文本文件中搜索数千个字符串

Question

I have a large text file that is 20 GB in size. 我有一个20 GB的大文本文件。 The file contains lines of text that are relatively short (40 to 60 characters per line). 该文件包含相对短的文本行（每行40至60个字符）。 The file is unsorted. 该文件未排序。

I have a list of 20,000 unique strings. 我有20,000个唯一字符串的列表。 I want to know the offset for each string each time it appears in the file. 我想知道每次出现在文件中的每个字符串的偏移量。 Currently, my output looks like this: 目前，我的输出如下所示：

netloader.cc found at offset: 46350917
netloader.cc found at offset: 48138591
netloader.cc found at offset: 50012089
netloader.cc found at offset: 51622874
netloader.cc found at offset: 52588949
...
360doc.com found at offset: 26411474
360doc.com found at offset: 26411508
360doc.com found at offset: 26483662
360doc.com found at offset: 26582000

I am loading the 20,000 strings into a std::set (to ensure uniqueness), then reading a 128MB chunk from the file, and then using string::find to search for the strings (start over by reading another 128MB chunk). 我正在将20,000个字符串加载到std :: set中（以确保唯一性），然后从文件中读取128MB块，然后使用string :: find搜索字符串（首先读取另一个128MB块）。 This works and completes in about 4 days. 该工作将在大约4天内完成。 I'm not concerned about a read boundary potentially breaking a string I'm searching for. 我不担心读取边界可能破坏我正在搜索的字符串。 If it does, that's OK. 如果可以，那没关系。

I'd like to make it faster. 我想使其更快。 Completing the search in 1 day would be ideal, but any significant performance improvement would be nice. 在1天之内完成搜索将是理想的，但是任何显着的性能改进都将是不错的。 I prefer to use standard C++ with Boost (if necessary) while avoiding other libraries. 我更喜欢将标准C ++与Boost结合使用（如有必要），同时避免使用其他库。

So I have two questions: 所以我有两个问题：

Does the 4 day time seem reasonable considering the tools I'm using and the task? 考虑到我使用的工具和任务，这4天的时间是否合理？
What's the best approach to make it faster? 使其更快的最佳方法是什么？

Thanks. 谢谢。

Edit: Using the Trie solution, I was able to shorten the run-time to 27 hours. 编辑：使用Trie解决方案，我可以将运行时间缩短到27小时。 Not within one day, but certainly much faster now. 不在一天之内，但是现在肯定要快得多。 Thanks for the advice. 谢谢你的建议。

Answer 1

The problem you describe looks more like a problem with the selected algorithm, not with the technology of choice. 您描述的问题看起来更像是所选算法的问题，而不是所选技术的问题。 20000 full scans of 20GB in 4 days doesn't sound too unreasonable, but your target should be a single scan of the 20GB and another single scan of the 20K words. 在4天之内进行20000次20GB的完全扫描听起来似乎不太合理，但是您的目标应该是一次20GB的扫描，再一次搜索20K个字的扫描。

Have you considered looking at some string matching algorithms? 您是否考虑过考虑一些字符串匹配算法？ Aho–Corasick comes to mind. 我想到了Aho–Corasick。

Answer 2

Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. 从算法上讲，我认为解决此问题的最佳方法是使用树来存储您想一次搜索字符的行。 For example if you have the following patterns you would like to look for: 例如，如果您要寻找以下模式：

hand, has, have, foot, file

The resulting tree would look something like this: 生成的树看起来像这样： 搜索词列表生成的树

The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally. 树的生成是最坏的情况O（n），通常具有亚线性内存占用量。

Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree. 使用此结构，您可以通过一次读取巨大文件中的一个字符来开始处理文件，然后遍历树。

If you get to a leaf node (the ones shown in red), you have found a match, and can store it. 如果到达叶节点（红色显示的节点），则找到一个匹配项并可以存储它。
If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree 如果没有子节点（与您拥有红色的字母相对应），则可以丢弃当前行，然后从树的根部开始检查下一行

This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once. 此技术将导致线性时间O（n）来检查匹配项，并且仅扫描一次20gb巨大的文件。

Edit 编辑

The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). 上面描述的算法当然是正确的 （不会产生误报），但是并不完整（可能会遗漏一些结果）。 However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone . 然而，有一些小的调整，它可以制成完整的，假设我们没有与像去了共同的根源搜索词。 The following is pseudocode of the complete version of the algorithm 以下是该算法完整版本的伪代码

tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
# Keeps track of where I'm currently in the tree
nodes = []
for character in huge_file:
  foreach node in nodes:
    if node.has_child(character):
      node.follow_edge(character)
      if node.isLeaf():
        # You found a match!!
    else:
      nodes.delete(node)
  if tree.has_child(character):
    nodes.add(tree.get_child(character))

Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. 请注意，每次必须检查的nodes列表最多为必须检查的最长单词的长度。 Therefore it should not add much complexity. 因此，它不应增加太多复杂性。

Answer 3

Rather than searching 20,000 times for each string separately, you can try to tokenize the input and do lookup in your std::set with strings to be found, it will be much faster. 您可以尝试对输入进行标记化，并在std::set查找要查找的字符串，而不是分别为每个字符串搜索20,000次，这将更快。 This is assuming your strings are simple identifiers, but something similar can be implemented for strings being sentences. 这是假设您的字符串是简单的标识符，但是对于作为句子的字符串，可以实现类似的功能。 In this case you would keep a set of first words in each sentence and after successful match verify that it's really beginning of the whole sentence with string::find . 在这种情况下，您将在每个句子中保留一组第一个单词，并且在成功匹配之后，使用string::find验证它确实是整个句子的开头。

在大文本文件中搜索数千个字符串

问题描述

3 个解决方案

解决方案1
3 2013-05-03 14:22:34

解决方案2
3 已采纳 2013-05-03 14:40:39

Edit 编辑

解决方案3
0 2013-05-03 14:29:35

在大文本文件中搜索数千个字符串

问题描述

3 个解决方案

解决方案1 3 2013-05-03 14:22:34

解决方案2 3 已采纳 2013-05-03 14:40:39

Edit 编辑

解决方案3 0 2013-05-03 14:29:35

解决方案1
3 2013-05-03 14:22:34

解决方案2
3 已采纳 2013-05-03 14:40:39

解决方案3
0 2013-05-03 14:29:35