简体   繁体   中英

Efficient way to search for a set of strings in a string in Java

I have a set of elements of size about 100-200. Let a sample element be X .

Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = { s1 , s2 , s3 }

For a given input string (about 100 characters), say P , I want to test whether any of the X is present in the string.

X is present in P iff for all s belong to X , s is a substring of P .

The set of elements is available for pre-processing.


I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:

  • Checking whether all the strings s are substring of P seems like a costly operation
  • Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
  • I cannot directly use regex as s1 , s2 , s3 can be present in any order and all of the strings need to be present as substring

Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.

Please note that the set of elements is available for pre-processing and I want the solution in java.

You can use regex directly:

Pattern regex = Pattern.compile(
    "^               # Anchor search to start of string\n" +
    "(?=.*s1)        # Check if string contains s1\n" +
    "(?=.*s2)        # Check if string contains s2\n" +
    "(?=.*s3)        # Check if string contains s3", 
    Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();

foundMatch is true if all three substrings are present in the string.

Note that you might need to escape your "needle strings" if they could contain regex metacharacters.

It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.

The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P . This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.

My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.

Looks like a perfect case for the Rabin–Karp algorithm :

Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.

When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.

The algorithm to index a string would look like that (untested):

HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();

for (int pos = 0; pos < string.length(); pos++) {
    for (int sublen=0; sublen < string.length-pos; sublen++) {
         String substring = string.substr(pos, sublen);
         Set<String> stringsForThisKey = indexes.get(substring);
         if (stringsForThisKey == null) {
             stringsForThisKey = new HashSet<String>();
             indexes.put(substring, stringsForThisKey);
         }
         stringsForThisKey.add(string);
    }
}

Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.

But the result would be constant-speed access to the list of strings in which a specific string occurs.

您可能正在寻找Aho-Corasick算法 ,该算法从字符串集(字典)构造自动机(类似于trie),并尝试使用此自动机将输入字符串与字典进行匹配。

One way is to generate every possible substring and add this to a set. This is pretty inefficient.

Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.

static class SubstringMatcher {
    final NavigableSet<String> set = new TreeSet<String>();

    SubstringMatcher(Set<String> strings) {
        for (String string : strings) {
            for (int i = 0; i < string.length(); i++)
                set.add(string.substring(i));
        }
        // remove duplicates.
        String last = "";
        for (String string : set.toArray(new String[set.size()])) {
            if (string.startsWith(last))
                set.remove(last);
            last = string;
        }
    }

    public boolean findIn(String s) {
        String s1 = set.ceiling(s);
        return s1 != null && s1.startsWith(s);
    }
}

public static void main(String... args) {
    Set<String> strings = new HashSet<String>();
    strings.add("hello");
    strings.add("there");
    strings.add("old");
    strings.add("world");
    SubstringMatcher sm = new SubstringMatcher(strings);
    System.out.println(sm.set);
    for (String s : "ell,he,ow,lol".split(","))
        System.out.println(s + ": " + sm.findIn(s));
}

prints

[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here

I have used proprietary implementations (that I no longer even have access to) and they are very fast.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM