简体   繁体   中英

Given a string s and an array of smaller strings, T, how to design a method to search s for each small string in T?

Given a string s and an array of smaller strings, T, design a method to search s for each small string in T.

Thanks.

Assuming you have a significant number of smaller strings, Rabin-Karp is the standard way to search for multiple small strings in a very very large string. if You only have a few smaller strings, simply repeating Boyer-Moore for each one might be a better alternative.

The fastest way that I know of that solves this problem is the Aho-Corasick algorithm . For large strings and large numbers of patterns to be searched, it is faster than applying a linear time search (eg KMP, Rabin-Karp, Boyer-Moore) for each pattern.

But are you sure you need something like this and that your strings are too long for the straightforward method of string matching?

This sounds like a simple for loop:

for(string t : T)
{
    if (t.equals(s)) {
        /* do stuff with t */
    }
}

From How to use For Each loop

You can't pick a "best" algorithm without knowing more details about the data set.

  • Are these statistically random strings?
  • Is there a lot or a little repetition in the small strings?
  • Do you want to optimize for speed of execution or low memory consumption?
  • Will you perform this search multiple times with the same sub strings (T) or the same main string (s)?

Without this information, the "best" solution is the simplest one.

static IEnumerable<string> FindIn(this IEnumerable<string> T, string s) {
    return T.Where(t => s.Contains(t));
}

Could you please clarify a bit?

**The algorithm would STRONGLY depend on what you mean by "Search for". **

  • Do you wish to find whether every string in T is a proper sub-string of S? Or Any string?

  • Do you need Yes/No answer or indexes?

  • Do you care if the answers overlap (eg "ABCDE" contains both "ABC" and "CDE" but ONLY if you don't care about overlapping).

A simplistic method (assuming search strings all start out fairly differently) is to:

  • Have a map of "first character" => map_of_first_2_characters__to__list_of_strings.

  • Loop over each position in S, find the character as a key in above map.

    • The value will be another map, mapping 2-character strings to a list of substrings starting with these 2 characters.

    • Find the character and its right neighbour in the submap the value will be a list of strings starting with these 2 values.

    • Assuming fairly equal distribution of starting characters in T and T not being too big (if it's too big, merely build the data structure one level more, by mapping 3 characters) - we just found a very short list of plausible matches starting with the current position. String-compare them all. Mark off the ones (if any) that are substrings of S starting with current position. If the goal is not to find ALL matches of ALL strings, eliminate the ones you found as matches from the data structure.

You may wish to read this for advanced stuff

Let's make it into a Java solution

boolean isSubset(String[] t, String s) {
    for (String sample: t)
        if (!sample.equals(s))
            return false;
    return true;
}

You can make it faster using Falaina's recommendations, but do you really need it?

If you have room for a table of pointers (Pointer Size * NumCharsInSource), you can sort each string in the source (the string starting at character) using something like QSort. You can then BSearch the smaller strings into the pointer table. Assuming N characters and M substrings, the sort will have O(N lg N) performance and the lookups will have O(M lg N) performance. The overall performance should be O((N+M) lg N).

However, there can be degenerate cases in which the strings in the source are highly repetitive (ie 100,000 a's followed by ab). That will make the compare for the sort portion very slow :-( to get around this, you can special case long runs of characters but that gets much more complicated.

The algorithm to choose really depends on your source data and how much spare memory you have to work with.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM