简体繁体 English

在很大的文本上搜索许多字符串

[英]Search many strings over a very large text

原文 2014-02-18 06:21:31 6 2 c++/ string/ search/ trie/ large-text

I have like 2 million strings and I need to search each of them over a 1 TB text data. 我有200万个字符串，我需要在1 TB的文本数据中搜索每个字符串。 Searching all of them is not a best solution to do, so I was thinking about a better way to create a data structure like trie for all of the strings. 搜索所有这些并不是最好的解决方案，因此我正在考虑一种更好的方法来为所有字符串创建像trie这样的数据结构。 In other words, a trie in which each node in that is a word. 换句话说，一个trie，其中的每个节点都是一个单词。 I wanted to ask, is there any good algorithm, data structure or library (in C++) for this purpose? 我想问一下，是否有用于此目的的好的算法，数据结构或库（在C ++中）？

Let me be more descriptive in this question fellows, 在这个问题上让我更具描述性，

For instance, I have these strings: s1- "I love you" s2- "How are you" s3- "What's up dude" 例如，我有以下字符串：s1-“我爱你” s2-“你好吗” s3-“伙计怎么了”

And I have many text data like: t1- "Hi, my name is Omid and I love computers. How are you guys?" 而且我有许多文本数据，例如：t1-“嗨，我叫Omid，我爱电脑。你们好吗？” t2- "Your every wish will be done, they tell me..." t3 t4 . t2- t3 t4：“您的每一个愿望都会实现，他们告诉我...” . 。 . 。 t10000 T10000

Then I want to consider each of texts and search for each of strings on them. 然后，我要考虑每个文本并搜索它们上的每个字符串。 At last for this example I would just say: t1 contains s1 and nothing else. 最后，对于该示例，我只想说：t1包含s1，仅此而已。 I am looking for an efficient way to search for strings but not foolishly for each of them each time. 我正在寻找一种有效的方法来搜索字符串，但是每次都不要为每个字符串愚蠢。

2 个解决方案

I'm sorry to post a link only answer, but if you don't mind reading research paper, the definitive reference on string matching algorithms seems to me to be http://www-igm.univ-mlv.fr/~lecroq/string/ and the following research paper by Simone Faro and Thierry Lecroq where they compared the relative performance of no less that 85 different string matching algorithms. 很抱歉，我只发布一个链接答案，但是如果您不介意阅读研究论文，那么对我来说，关于字符串匹配算法的权威参考似乎是http://www-igm.univ-mlv.fr/~lecroq / string /以及Simone Faro和Thierry Lecroq的以下研究论文，他们比较了不少于85种不同字符串匹配算法的相对性能。 I'm pretty sure there is one fitting your need among them. 我敢肯定，其中有一种适合您的需求。

I would strongly suggest that you use CLucene ( http://clucene.sourceforge.net/ ) which is a port from the Apache Lucene project. 我强烈建议您使用CLucene（ http://clucene.sourceforge.net/ ），它是Apache Lucene项目的端口。 This will build you an inverted index and make text searching very fast. 这将为您建立一个倒排索引，并使文本搜索非常快。 If changing languages is an option consider doing this in Java as the CLucene version is a bit out of date. 如果可以选择更改语言，请考虑使用Java进行此操作，因为CLucene版本已过时。 It will be slower but has more features. 它将较慢，但具有更多功能。