简体   繁体   中英

Longest substring in a large set of strings

I have a huge fixed library of text strings, and a frequently changing input string s. I need to find the longest matching substring from any string in the library to s, starting from the beginning of string s, in minimal time. In a perfect world, I would also return the next longest match from the library, and the next best, and so on. This is not the longest common string problem - I'm not looking for the longest common string for all the strings in the library... I just need a pairwise best substring between s and each string in the vast library as fast as possible.

After rereading, I think the best way to do this is probably to build a trie or a prefix tree of your big library of strings, then match s against that.

This has a couple of advantages. First, it stores your big library in (at least somewhat) compressed form. Second, it more or less automatically tells you all the strings that match a given input, not just the longest one.

It also fits your use case quite well -- while it takes quite a bit of work to build a trie or (especially) a prefix tree from the input, using it afterwards is quite fast.

sort your list ahead of time (ie compile time or before) then use bsearch

http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/

once you find your match, you can look behind and ahead in the vicinity to get as many "matches" as you want.

BTW, bsearch isn't necessarily the fastest because it passes the comparator function, but is in the standard C library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM