简体   繁体   中英

Algorithm to match sequential subset from a list

I am trying to remember the right algorithm to find a subset within a set that matches an element of a list of possible subsets. For example, given the input:

aehfaqptpzzy

and the subset list:

{ happy, sad, indifferent }

we can see that the word "happy" is a match because it is inside the input:

ae h f a q p t p zz y

I am pretty sure there is a specific algorithm to find all such matches, but I cannot remember what it is called.

UPDATE

The above example is not very good because it has letter repetitions, in fact in my problem both the dictionary entries and the input string are sortable sets. For example,

input: acegimnrqvy

dictionary: { cgn, dfr, lmr, mnqv, eg }

So in this example the algorithm would return cgn, mnqv and eg as matches. Also, I would like to find the best set of complementary matches where "best" means longest. So, in the example above the "best" answer would be "cgn mnqv", eg would not be a match because it conflicts with cgn which is a longer match.

I realize that the problem can be done by brute force scan, but that is undesirable because there could be thousands of entries in the dictionary and thousands of values in the input string. If we are trying to find the best set of matches, computability will become an issue.

You can use the Aho - Corrasick algorithm with more than one current states. For each of the input letters, you either stay (skip the letter) or move using the appropriate edge. If two or more "actors" meet at the same place, just merge them to one (if you're interested just in the presence and not counts).

About the complexity - this could be as slow as the naive O(MN) approach, because there can be up to size of dictionary actors. However, in practice, we can make a good use of the fact that many words are substrings of others, because there never won't be more than size of the trie actors, which - compared to the size of the dictionary - tends to be much smaller.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM