简体   繁体   中英

Is there an efficient way to detect if a string contains a substring which is in a large set of characteristic strings?

For example, given a string aaaaaaaaaXyz , I want to find out if it contains a substring which is in a characteristic string set {'xy','xyz','zzz','cccc','dddd',....} , which may have one million members. Is there an efficient way?

Given that your search set might be very large, I would recommend just iterating that set and checking for a potential substring match:

public boolean containsSubstring(String input, Set<String> subs) {
    boolean match = false;

    for (String sub : subs) {
        if (input.contains(sub)) {
            match = true;
            break;
        }
    }

    return match;
}

First of all, you prepare the dictionary . just like this

Set<String> stringSet = Set.of("xy", "xyz", "zzz", "zzy", "cccc", "dddd");
Map<Character, List<String>> dictionary = new HashMap<>();
for (String word : stringSet)
    dictionary.computeIfAbsent(word.charAt(0), k -> new ArrayList<>()).add(word);
System.out.println(dictionary);

output:

{c=[cccc], d=[dddd], x=[xyz, xy], z=[zzy, zzz]}

And you can use this method to find out.

static boolean contains(String input, Map<Character, List<String>> dictionary) {
    for (int i = 0, max = input.length(); i < max; ++i) {
        char first = input.charAt(i);
        if (dictionary.containsKey(first))
            for (String word : dictionary.get(first))
                if (input.startsWith(word, i))
                    return true;
    }
    return false;
}

With the hint of Clashsoft ,I found the java implementation of Aho-Corasick algorithm, which is the one i want,thanks for Clashsoft

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM