简体   繁体   English

高效的字符串文本搜索

[英]Efficient String text search

I'd like to create a method which searches a small String of text (usually no more than 256 characters) for the existence of any of about 20 different words. 我想创建一个方法,搜索一小段文本(通常不超过256个字符),以存在大约20个不同的单词。 If it finds one in the text regardless of case it returns a true. 如果它在文本中找到一个,无论大小写都返回true。

The method will be executed a quite a bit (not a crazy amount) so it has to be as efficient as possible. 该方法将执行相当多(不是疯狂的数量),因此它必须尽可能高效。 What do you suggest would be best here? 你觉得这里最好的是什么?

The 20 words do not change. 这20个字不会改变。 They are static. 它们是静态的。 But the text to scan does. 但要扫描的文本呢。

I'd suggest: add all the words in the input text to a Set - it's only 256 characters after all, and adding them is an O(n) operation. 我建议:将输入文本中的所有单词添加到Set - 毕竟它只有256个字符,并且添加它们是O(n)操作。

After that you can test each of the 20 or so words for membership using the contains() operation of the Set , which is O(1) . 之后,您可以使用Setcontains()操作测试20个左右的单词中的每一个,即O(1)

Since the 20 words to search don't change, one of the fastest ways to look for them is compiling a regular expression that matches them and reuse it on different inputs. 由于要搜索的20个单词不会更改,因此查找它们的最快方法之一是编译匹配它们的正则表达式,并在不同的输入上重用它。 The complexity of matching a regular expression to a given string is linear to the string length for simple regular expressions that don't require backtracking. 将正则表达式与给定字符串匹配的复杂性与不需要回溯的简单正则表达式的字符串长度成线性关系。 In your case the length is bounded, so it's O(1). 在你的情况下,长度是有界的,所以它是O(1)。

The String class already has lots of methods to do these sorts of things. String类已经有很多方法可以做这些事情。 For example, the indexOf method will solve your problem: 例如, indexOf方法将解决您的问题:

String str = "blahblahtestblah";
int result = str.indexOf("test");

result will contain -1 if the string does not contain the word "test". 如果字符串不包含单词“test”,则result将包含-1。 I'm not sure if this is efficient enough for you but I would start here as it's been implemented already! 我不确定这对你来说是否足够有效但我会从这里开始,因为它已经实现了!

Assuming these 20 words are in a Set<String> and all are lowercase, then it is as easy as: 假设这20个单词在Set<String>并且都是小写的,那么它就像下面这样简单:

public final boolean containsWord(final String input)
{
    final String s = input.toLowerCase();
    for (final String word: wordSet)
        if (s.indexOf(word) != -1)
            return true;
    return false;
}

If you want to search for a number of different targets simultaneously, then the Rabin-Karp algorithm is a possibility. 如果你想同时搜索许多不同的目标,那么Rabin-Karp算法是可能的。 If is especially efficient if there are only a few different word lengths in your list of 20 targets. 如果在20个目标列表中只有几个不同的单词长度,则效率特别高。 One single pass through the string will find all the matches of a given length. 一次通过字符串将找到给定长度的所有匹配。

I'd do the following: 我会做以下事情:

String longStr //the string to search into
ArrayList<String> words; //the words to check

Iterator<String> iter = words.iterator();
while(iter.hasNext())
{
    if(longStr.contains(iter.next()))
        return true;    
}
return false;

You can get all the words to a List, sort it and use Collections.binarySearch(...). 您可以将所有单词添加到List中,对其进行排序并使用Collections.binarySearch(...)。 You will loose on sorting, but the binarySearch is log(n). 排序时会松动,但binarySearch是log(n)。

Ok. 好。 Thanks for answering and commenting everybody. 感谢您回答和评论每个人。 I realise that the question I asked can have broad and varied answers. 我意识到我提出的问题可以有广泛而多样的答案。 But this is what I ended up using because the performance was very important so using standard Collections just won't cut the mustard. 但这是我最终使用的原因,因为性能非常重要,因此使用标准集合不会削减芥末。

I used a "Patricia Trie" structure which is a very powerful and elegant datastructure capable of offering low memory overheads and extremely fast search speeds. 我使用了“Patricia Trie”结构,这是一种非常强大而优雅的数据结构,能够提供低内存开销和极快的搜索速度。

If anyone is interested, there is a video here briefly explaining how a Patricia Trie works. 如果有人有兴趣,这里有一个视频,简要介绍Patricia Trie的工作原理。 You will realise why it's so performant after watching. 你会意识到为什么看完后它会如此高效。 Also there is a Java implementation of the data structure on github here . 此外, github上还有一个数据结构的Java实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM