简体   繁体   English

找到一组等长字符串的重叠?

[英]Find the overlap of a set of equal length string?

There are 1 million of equal length strings(short string).For example 有100万个等长字符串(短字符串)。例如

abcdefghi ABCDEFGHI

fghixyzyz fghixyzyz

ghiabcabc ghiabcabc

zyzdddxfg zyzdddxfg

. . .

I want to find pair-wise overlap of two string.The overlap of A"abcdefghi" and B"fghixyzyz" is "fghi",which is the maximal suffix of A , the maximal prefix of B ,satisfy the suffix and the prefix are equal. 我想找到两个字符串的成对重叠.A“abcdefghi”和B“fghixyzyz”的重叠是“fghi”,它是A的最大后缀,B的最大前缀,满足后缀,前缀是等于。

Is there efficient algorithm which can find the overlap of any two strings in the set? 是否有有效的算法可以找到集合中任意两个字符串的重叠?

One of the efficient ways is to build a general suffix tree for the string set. 一种有效的方法是为字符串集构建通用后缀树。 To find the overlap between string x and y: 要查找字符串x和y之间的重叠:

Follow the path label for string y in the general suffix tree. 在通用后缀树中跟随字符串y的路径标签。 The deepest node along this path that is incident to the terminal symbol of string x has a path label which is equivalent to the suffix-prefix overlap x->y. 沿着该路径入射到字符串x的终端符号的最深节点具有路径标签,该路径标签等同于后缀 - 前缀重叠x-> y。

For more details see page 137 ("Solving the all-pairs suffix-prefix problem in linear time") of Gusfield's book "Algorithms on Strings, Trees, And Sequences". 有关更多详细信息,请参阅Gusfield的书“字符串,树和序列上的算法”中的第137页(“解决线性时间内的所有对后缀前缀问题”)。

Caution: this uses a LOT of memory if your dataset is large (millions/billions of strings). 注意:如果数据集很大(数百万/数十亿字符串),则会使用大量内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM