简体   繁体   English

大量字符串中最长的子字符串

[英]Longest substring in a large set of strings

I have a huge fixed library of text strings, and a frequently changing input string s. 我有一个庞大的固定文本字符串库,并且经常更改输入字符串s。 I need to find the longest matching substring from any string in the library to s, starting from the beginning of string s, in minimal time. 我需要在最短的时间内从字符串s的开头开始,找到库中任何字符串到s的最长匹配子字符串。 In a perfect world, I would also return the next longest match from the library, and the next best, and so on. 在一个完美的世界中,我还将返回库中的下一个最长匹配,以及下一个最佳匹配,依此类推。 This is not the longest common string problem - I'm not looking for the longest common string for all the strings in the library... I just need a pairwise best substring between s and each string in the vast library as fast as possible. 这不是最长的公共字符串问题-我不是在寻找库中所有字符串的最长的公共字符串...我只需要s和庞大的库中的每个字符串之间尽可能快的成对最佳子字符串。

After rereading, I think the best way to do this is probably to build a trie or a prefix tree of your big library of strings, then match s against that. 重新阅读后,我认为做到这一点的最佳方法可能是在大型字符串库中构建一个trie或前缀树,然后将s与之匹配。

This has a couple of advantages. 这有两个优点。 First, it stores your big library in (at least somewhat) compressed form. 首先,它以(至少某种程度上)压缩形式存储您的大库。 Second, it more or less automatically tells you all the strings that match a given input, not just the longest one. 其次,它或多或少会自动告诉您所有与给定输入匹配的字符串,而不仅仅是最长的字符串。

It also fits your use case quite well -- while it takes quite a bit of work to build a trie or (especially) a prefix tree from the input, using it afterwards is quite fast. 它也非常适合您的用例-尽管需要花费大量工作才能从输入中构建特里树(或特别是)构建前缀树,但是之后使用起来非常快。

sort your list ahead of time (ie compile time or before) then use bsearch 提前对列表进行排序(即在编译之前或之前),然后使用bsearch

http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/ http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/

once you find your match, you can look behind and ahead in the vicinity to get as many "matches" as you want. 找到匹配项后,您就可以在附近向前和向后看,以获得所需的“匹配项”。

BTW, bsearch isn't necessarily the fastest because it passes the comparator function, but is in the standard C library. 顺便说一句,bsearch不一定是最快的,因为它通过了比较器功能,但是在标准C库中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM