简体   繁体   English

在Java中搜索字符串中的一组字符串的有效方法

[英]Efficient way to search for a set of strings in a string in Java

I have a set of elements of size about 100-200. 我有一组大小约100-200的元素。 Let a sample element be X . 让样本元素为X

Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). 每个元素都是一组字符串(这样一组中的字符串数在1到4之间)。 X = { s1 , s2 , s3 } X = { s1s2s3 }

For a given input string (about 100 characters), say P , I want to test whether any of the X is present in the string. 对于给定的输入字符串(约100个字符),说P ,我想测试任何是否X存在于串英寸

X is present in P iff for all s belong to X , s is a substring of P . 对于所有s属于XX 存在P iff中, sP的子串。

The set of elements is available for pre-processing. 这组元素可用于预处理。


I want this to be as fast as possible within Java. 我希望在Java中尽可能快。 Possible approaches which do not fit my requirements: 可能的方法不符合我的要求:

  • Checking whether all the strings s are substring of P seems like a costly operation 检查所有的字符串是否s的子串的P似乎是一个代价高昂的操作
  • Because s can be any substring of P (not necessarily a word), I cannot use a hash of words 因为s可以是P任何子串(不一定是单词),所以我不能使用单词的散列
  • I cannot directly use regex as s1 , s2 , s3 can be present in any order and all of the strings need to be present as substring 我不能直接使用正则表达式,因为s1s2s3可以以任何顺序出现,并且所有字符串都需要作为子字符串出现

Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. 现在我的方法是从每个X构造一个巨大的正则表达式,其中包含字符串顺序的所有可能排列。 Because number of elements in X <= 4, this is still feasible. 因为X <= 4中的元素数量,这仍然是可行的。 It would be great if somebody can point me to a better (faster/more elegant) approach for the same. 如果有人能指出我更好(更快/更优雅)的方法,那将是很棒的。

Please note that the set of elements is available for pre-processing and I want the solution in java. 请注意,元素集可用于预处理,我想要java中的解决方案。

You can use regex directly: 可以直接使用正则表达式:

Pattern regex = Pattern.compile(
    "^               # Anchor search to start of string\n" +
    "(?=.*s1)        # Check if string contains s1\n" +
    "(?=.*s2)        # Check if string contains s2\n" +
    "(?=.*s3)        # Check if string contains s3", 
    Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();

foundMatch is true if all three substrings are present in the string. 如果字符串中存在所有三个子字符串,则foundMatch为true。

Note that you might need to escape your "needle strings" if they could contain regex metacharacters. 请注意,如果它们可能包含正则表达式元字符,则可能需要转义“”字符串“。

It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow. 听起来你在实际发现特定方法实际上太慢之前过早地优化了代码。

The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P . 关于你的字符串集的一个很好的属性是字符串必须包含X所有元素作为子字符串 - 这意味着如果我们找到一个未包含在PX元素,我们就会快速失败。 This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. 这可能会比其他方法更好地节省时间,特别是如果X的元素通常长于几个字符并且不包含或仅包含少量重复字符。 For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). 例如,当检查是否存在具有非重复字符(例如,惯性)的5长度字符串时,正则表达式引擎仅需要检查100个长度字符串中的20个字符。 And since X has 100-200 elements you really, really want to fail fast if you can. 而且由于X确实有100-200个元素,所以如果可以的话,真的想要快速失败。

My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found. 我的建议是按照长度顺序对字符串进行排序,并依次检查每个字符串,如果找不到一个字符串则提前停止。

Looks like a perfect case for the Rabin–Karp algorithm : 看起来像Rabin-Karp算法的完美案例:

Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. Rabin-Karp因单一模式搜索Knuth-Morris-Pratt算法,Boyer-Moore字符串搜索算法以及其他更快的单模式字符串搜索算法而劣势,因为它具有缓慢的最坏情况行为。 However, Rabin–Karp is an algorithm of choice for multiple pattern search. 然而,Rabin-Karp是多模式搜索的首选算法。

When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs. 当预处理时间无关紧要时,您可以创建一个哈希表,该表将每个单字母,双字母,三字母等组合映射到至少一个字符串中的字符串列表中。

The algorithm to index a string would look like that (untested): 索引字符串的算法看起来像那样(未经测试):

HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();

for (int pos = 0; pos < string.length(); pos++) {
    for (int sublen=0; sublen < string.length-pos; sublen++) {
         String substring = string.substr(pos, sublen);
         Set<String> stringsForThisKey = indexes.get(substring);
         if (stringsForThisKey == null) {
             stringsForThisKey = new HashSet<String>();
             indexes.put(substring, stringsForThisKey);
         }
         stringsForThisKey.add(string);
    }
}

Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string. 索引每个字符串的方式将是字符串长度的二次方,但只需要为每个字符串完成一次。

But the result would be constant-speed access to the list of strings in which a specific string occurs. 但结果是对发生特定字符串的字符串列表进行恒速访问。

您可能正在寻找Aho-Corasick算法 ,该算法从字符串集(字典)构造自动机(类似于trie),并尝试使用此自动机将输入字符串与字典进行匹配。

One way is to generate every possible substring and add this to a set. 一种方法是生成每个可能的子字符串并将其添加到集合中。 This is pretty inefficient. 这非常低效。

Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. 相反,您可以创建从任何点到最后的所有字符串到NavigableSet并搜索最接近的匹配。 If the closest match starts with the string you are looking for, you have a substring match. 如果最接近的匹配以您要查找的字符串开头,则您具有子字符串匹配。

static class SubstringMatcher {
    final NavigableSet<String> set = new TreeSet<String>();

    SubstringMatcher(Set<String> strings) {
        for (String string : strings) {
            for (int i = 0; i < string.length(); i++)
                set.add(string.substring(i));
        }
        // remove duplicates.
        String last = "";
        for (String string : set.toArray(new String[set.size()])) {
            if (string.startsWith(last))
                set.remove(last);
            last = string;
        }
    }

    public boolean findIn(String s) {
        String s1 = set.ceiling(s);
        return s1 != null && s1.startsWith(s);
    }
}

public static void main(String... args) {
    Set<String> strings = new HashSet<String>();
    strings.add("hello");
    strings.add("there");
    strings.add("old");
    strings.add("world");
    SubstringMatcher sm = new SubstringMatcher(strings);
    System.out.println(sm.set);
    for (String s : "ell,he,ow,lol".split(","))
        System.out.println(s + ": " + sm.findIn(s));
}

prints 版画

[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

You might want to consider using a "Suffix Tree" as well. 您可能还想考虑使用“后缀树”。 I haven't used this code, but there is one described here 我没有用过这个代码,但是有一个描述在这里

I have used proprietary implementations (that I no longer even have access to) and they are very fast. 我使用了专有的实现(我甚至不能访问)并且它们非常快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在另一个字符串中搜索字符串数组的最有效方法 - The most efficient way to search for an array of strings in another string 在java中替换字符串的有效方法 - Efficient way to replace strings in java 在流中搜索字符串的有效方法 - Efficient way to search a stream for a string 在Java中另一个字符串中搜索和替换一组字符串的最佳方法 - Best way to search and replace a group of strings in another string in java 什么 DS 用于在 Java 的字符串集中搜索字符串? - What DS to use for search a string in set of strings in Java? 最快/最高效的方式来解析文档,搜索字符串并使用Java将其替换为文档中的字符串 - Fastest/Most efficient way to parse a document, search for strings and replace them in document with Java 在Java中拆分String的有效方法 - The efficient way to split a String in Java 是否有一种有效的方法来检测字符串是否包含一大组特征字符串中的 substring? - Is there an efficient way to detect if a string contains a substring which is in a large set of characteristic strings? 搜索字符串中未知模式的最有效方法是什么? - Most efficient way to search for unknown patterns in a string? 搜索String数组以查找子字符串的最有效方法 - Most efficient way to search String array for substring
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM