简体繁体 English

找到一组等长字符串的重叠？

[英]Find the overlap of a set of equal length string?

原文 2012-05-25 03:20:54 9 1 string/ algorithm/ string-comparison/ overlap/ stringcollection

There are 1 million of equal length strings(short string).For example 有100万个等长字符串（短字符串）。例如

abcdefghi ABCDEFGHI

fghixyzyz fghixyzyz

ghiabcabc ghiabcabc

zyzdddxfg zyzdddxfg

. 。 . 。 . 。

I want to find pair-wise overlap of two string.The overlap of A"abcdefghi" and B"fghixyzyz" is "fghi",which is the maximal suffix of A , the maximal prefix of B ,satisfy the suffix and the prefix are equal. 我想找到两个字符串的成对重叠.A“abcdefghi”和B“fghixyzyz”的重叠是“fghi”，它是A的最大后缀，B的最大前缀，满足后缀，前缀是等于。

Is there efficient algorithm which can find the overlap of any two strings in the set? 是否有有效的算法可以找到集合中任意两个字符串的重叠？

1 个解决方案

One of the efficient ways is to build a general suffix tree for the string set. 一种有效的方法是为字符串集构建通用后缀树。 To find the overlap between string x and y: 要查找字符串x和y之间的重叠：

Follow the path label for string y in the general suffix tree. 在通用后缀树中跟随字符串y的路径标签。 The deepest node along this path that is incident to the terminal symbol of string x has a path label which is equivalent to the suffix-prefix overlap x->y. 沿着该路径入射到字符串x的终端符号的最深节点具有路径标签，该路径标签等同于后缀 - 前缀重叠x-> y。

For more details see page 137 ("Solving the all-pairs suffix-prefix problem in linear time") of Gusfield's book "Algorithms on Strings, Trees, And Sequences". 有关更多详细信息，请参阅Gusfield的书“字符串，树和序列上的算法”中的第137页（“解决线性时间内的所有对后缀前缀问题”）。

Caution: this uses a LOT of memory if your dataset is large (millions/billions of strings). 注意：如果数据集很大（数百万/数十亿字符串），则会使用大量内存。