简体   繁体   中英

Find the overlap of a set of equal length string?

There are 1 million of equal length strings(short string).For example

abcdefghi

fghixyzyz

ghiabcabc

zyzdddxfg

. . .

I want to find pair-wise overlap of two string.The overlap of A"abcdefghi" and B"fghixyzyz" is "fghi",which is the maximal suffix of A , the maximal prefix of B ,satisfy the suffix and the prefix are equal.

Is there efficient algorithm which can find the overlap of any two strings in the set?

One of the efficient ways is to build a general suffix tree for the string set. To find the overlap between string x and y:

Follow the path label for string y in the general suffix tree. The deepest node along this path that is incident to the terminal symbol of string x has a path label which is equivalent to the suffix-prefix overlap x->y.

For more details see page 137 ("Solving the all-pairs suffix-prefix problem in linear time") of Gusfield's book "Algorithms on Strings, Trees, And Sequences".

Caution: this uses a LOT of memory if your dataset is large (millions/billions of strings).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM