简体   繁体   English

多序列的最长公共子序列

[英]Longest Common Subsequence for Multiple Sequences

I have done a bunch of research for finding the longest for M = 2 sequences, but I am trying to figure out how to do it for M ≥ 2 sequences 我已经做了一堆研究,找到M = 2序列的最长时间,但我试图弄清楚如何对M≥2序列进行研究

I am being given N and M: M sequences, with N unique elements. 我被赋予N和M:M序列,具有N个独特元素。 N is the set of {1 - N}. N是{1 - N}的集合。 I have thought about the dynamic programming approach, but I am still confused as to how to actually incorporate it. 我已经考虑过动态编程方法,但我仍然对如何实际合并它感到困惑。

Example input 示例输入

5 3
5 3 4 1 2
2 5 4 3 1
5 2 3 1 4

The max sequence here can be seen to be 这里的最大序列可以看出来

5 3 1

Expected output 预期产出

Length = 3

A simple idea. 一个简单的想法。

For each number i between 1 and N , calculate the longest subsequence where the last number is i . 对于1N之间的每个数字i ,计算最后一个数字为i的最长子序列。 (Let's call it a[i] ) (我们称之为a[i]

To do that, we'll iterate over numbers i in the first sequence from start to end. 为此,我们将从头到尾迭代第一个序列中的数字i If a[i] > 1 , then there's number j such that in each sequence it comes before i . 如果a[i] > 1 ,那么数字j使得在每个序列中它出现在i之前。
Now we can just check all possible values of j and (if previous condition holds) do a[i] = max(a[i], a[j] + 1) . 现在我们可以检查j所有可能值和(如果先前条件成立)做a[i] = max(a[i], a[j] + 1)

As the last bit, because j comes before i in first sequence, it means a[j] is already calculated. 作为最后一位,因为j在第一个序列中出现在i之前,这意味着已经计算a[j]

for each i in first_sequence
    // for the OP's example, 'i' would take values [5, 3, 4, 1, 2], in this order
    a[i] = 1;
    for each j in 1..N
        if j is before i in each sequence
            a[i] = max(a[i], a[j] + 1)
        end
    end
end

It's O(N^2*M) , if you calculate matrix of positions beforehand. 如果你事先计算位置矩阵,那就是O(N^2*M)

Since you have unique elements, @Nikita Rybak's answer is the one to go with, but since you mentioned dynamic programming, here's how you'd use DP when you have more than two sequences: 既然你有独特的元素,那么@Nikita Rybak的答案是可以接受的,但是既然你提到了动态编程,那么当你有两个以上的序列时,你可以使用DP:

dp[i, j, k] = length of longest common subsequence considering the prefixes
              a[1..i], b[1..j], c[1..k].


dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if a[i] = b[j] = c[k]
            = max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise

To get the actual subsequence back, use a recursive function that starts from dp[a.Length, b.Length, c.Length] and basically reverses the above formulas: if the three elements are equal, backtrack to dp[a.Length - 1, b.Length - 1, c.Length - 1] and print the character. 要获得实际的子序列,请使用从dp[a.Length, b.Length, c.Length]开始的递归函数,并基本上反转上述公式:如果三个元素相等,则回溯到dp[a.Length - 1, b.Length - 1, c.Length - 1]并打印字符。 If not, backtrack according to the max of the above values. 如果不是,则根据上述值的最大值回溯。

You can look into " Design of a new Deterministic Algorithm for finding Common DNA Subsequence " paper. 您可以查看“ 用于查找常见DNA子序列的新确定性算法的设计 ”论文。 You can use this algorithm to construct the DAG (pg 8, figure 5). 您可以使用此算法构建DAG(第8页,图5)。 From the DAG, read the largest common distinct subsequences. 从DAG中,读取最大的常见不同子序列。 Then try a dynamic programming approach on that using the value of M to decide how many DAGs you need to construct per sequence. 然后使用M的值尝试动态编程方法,以确定每个序列需要构建多少DAG。 Basically use these subsequences as key and store the corresponding sequence numbers where it is found and then try to find the largest subsequence (which can be more than 1). 基本上使用这些子序列作为键并将相应的序列号存储在找到它的位置,然后尝试找到最大的子序列(可以大于1)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM