简体   繁体   English

编辑距离递归算法— Skiena

[英]Edit distance recursive algorithm — Skiena

I'm reading The Algorithm Design Manual by Steven Skiena, and I'm on the dynamic programming chapter. 我正在阅读Steven Skiena撰写的《算法设计手册》,并且正在阅读动态编程一章。 He has some example code for edit distance and uses some functions which are explained neither in the book nor on the internet. 他有一些用于编辑距离的示例代码,并使用了一些功能,这些功能在书中和互联网上都没有说明。 So I'm wondering 所以我想知道

a) how does this algorithm work? a)该算法如何工作?

b) what do the functions indel and match do? b)indel和match函数有什么作用?

#define MATCH     0       /* enumerated type symbol for match */
#define INSERT    1       /* enumerated type symbol for insert */
#define DELETE    2       /* enumerated type symbol for delete */

int string_compare(char *s, char *t, int i, int j)
{
        int k;                  /* counter */
        int opt[3];             /* cost of the three options */
        int lowest_cost;        /* lowest cost */

        if (i == 0) return(j * indel(' '));
        if (j == 0) return(i * indel(' '));

        opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
        opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
        opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);

        lowest_cost = opt[MATCH];
        for (k=INSERT; k<=DELETE; k++)
                if (opt[k] < lowest_cost) lowest_cost = opt[k];

        return( lowest_cost );
}

On page 287 in the book: 在书的第287页上:

int match(char c, char d)
{
  if (c == d) return(0); 
  else return(1); 
}

int indel(char c)
{
  return(1);
}

They're explained in the book. 书中对它们进行了解释。 Please read section 8.2.4 Varieties of Edit Distance 请阅读8.2.4各种编辑距离

Basically, it utilizes the dynamic programming method of solving problems where the solution to the problem is constructed to solutions to subproblems, to avoid recomputation, either bottom-up or top-down. 基本上,它利用动态规划方法来解决问题,其中将问题的解决方案构造为子问题的解决方案,从而避免了自下而上或自上而下的重新计算。

The recursive structure of the problem is as given here , where i,j are start (or end) indices in the two strings respectively. 问题的递归结构被给出这里 ,其中i,j分别开始(或结束)索引在两个字符串。

在此处输入图片说明

Here's an excerpt from this page that explains the algorithm well. 这是此页面的摘录,很好地说明了算法。

Problem: Given two strings of size m, n and set of operations replace (R), insert (I) and delete (D) all at equal cost. 问题:给定两个大小为m,n和操作集的字符串,它们以相同的成本替换(R),插入(I)和删除(D)。 Find minimum number of edits (operations) required to convert one string into another. 查找将一个字符串转换为另一字符串所需的最少编辑(操作)次数。

Identifying Recursive Methods: 识别递归方法:

What will be sub-problem in this case? 在这种情况下,子问题是什么? Consider finding edit distance of part of the strings, say small prefix. 考虑找到部分字符串的编辑距离,例如小前缀。 Let us denote them as [1...i] and [1...j] for some 1< i < m and 1 < j < n. 让我们将它们表示为[1 ... i]和[1 ... j],表示1 <i <m和1 <j <n。 Clearly it is solving smaller instance of final problem, denote it as E(i, j). 显然,它正在解决最终问题的较小实例,将其表示为E(i,j)。 Our goal is finding E(m, n) and minimizing the cost. 我们的目标是找到E(m,n)并最小化成本。

In the prefix, we can right align the strings in three ways (i, -), (-, j) and (i, j). 在前缀中,我们可以通过三种方式(i,-),(-,j)和(i,j)正确对齐字符串。 The hyphen symbol (-) representing no character. 连字符(-)不代表任何字符。 An example can make it more clear. 一个例子可以使它更清楚。

Given strings SUNDAY and SATURDAY. 给定字符串SUNDAY和SATURDAY。 We want to convert SUNDAY into SATURDAY with minimum edits. 我们希望以最少的修改将SUNDAY转换为SATURDAY。 Let us pick i = 2 and j = 4 ie prefix strings are SUN and SATU respectively (assume the strings indices start at 1). 让我们选择i = 2和j = 4,即前缀字符串分别是SUN和SATU(假设字符串索引从1开始)。 The right most characters can be aligned in three different ways. 最右边的字符可以三种不同的方式对齐。

Case 1: Align characters U and U. They are equal, no edit is required. 情况1:将字符U和U对齐。它们相等,不需要编辑。 We still left with the problem of i = 1 and j = 3, E(i-1, j-1). 我们仍然面临i = 1和j = 3,E(i-1,j-1)的问题。

Case 2: Align right character from first string and no character from second string. 情况2:将第一个字符串的右字符与第二个字符串的无字符对齐。 We need a deletion (D) here. 我们需要在此处删除(D)。 We still left with problem of i = 1 and j = 4, E(i-1, j). 我们仍然面临i = 1和j = 4,E(i-1,j)的问题。

Case 3: Align right character from second string and no character from first string. 情况3:将第二个字符串中的右字符与第一个字符串中的无字符对齐。 We need an insertion (I) here. 我们需要在此处插入(I)。 We still left with problem of i = 2 and j = 3, E(i, j-1). 我们还剩下i = 2和j = 3,E(i,j-1)的问题。

Combining all the subproblems minimum cost of aligning prefix strings ending at i and j given by 组合所有子问题,以i到j结尾的前缀字符串对齐的最小开销为

E(i, j) = min( [E(i-1, j) + D], [E(i, j-1) + I], [E(i-1, j-1) + R if i,j characters are not same] ) E(i,j)= min([E(i-1,j-1)+ D],[E(i,j-1)+ I],[E(i-1,j-1)+ R如果i ,j个字符不相同])

We still not yet done. 我们还没有完成。 What will be base case(s)? 什么是基本案例?

When both of the strings are of size 0, the cost is 0. When only one of the string is zero, we need edit operations as that of non-zero length string. 当两个字符串的大小均为0时,开销为0。当只有一个字符串为零时,我们需要像非零长度字符串那样进行编辑操作。 Mathematically, 数学上

E(0, 0) = 0, E(i, 0) = i, E(0, j) = j E(0,0)= 0,E(i,0)= i,E(0,j)= j

I recommend going through this lecture for a good explanation. 我建议阅读本讲课以获得一个很好的解释。

The function match() returns 1, if the two characters mismatch (so that one more move is added in the final answer) otherwise 0. 如果两个字符不匹配(因此在最终答案中又增加了一个动作match() ,则match()函数将返回1,否则返回0。

Please go through this link: https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/styles/pages/editdistance.html 请通过以下链接访问: https : //secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/styles/pages/editdistance.html

the code implementing the above algorithm is : 实现上述算法的代码是:

int dpEdit(char *s1, char *s2 ,int len1,int len2)
{
if(len1==0)  /// Base Case
return len2;
else if(len2==0)
return len1;
else
{
    int add, remove,replace;
    int table[len1+1][len2+2];
    for(int i=0;i<=len2;i++)
    table[0][i]=i;
    for(int i=0;i<=len1;i++)
    table[i][0]=i;
    for(int i=1;i<=len1;i++)
    {
        for(int j=1;j<=len2;j++)
        {
          // Add 
          //
          add = table[i][j-1]+1;  
          remove = table[i-1][j]+1;
          if(s1[i-1]!=s2[j-1])
          replace = table[i-1][j-1]+1;
          else
          replace =table[i-1][j-1];
          table[i][j]= min(min(add,remove),replace); // Done :)

        }
    }

This is a recursive algorithm not dynamic programming. 这是一种递归算法,不是动态编程。 Note that both i & j point to the last char of s & t respectively when the algorithm starts. 请注意,算法开始时,i和j分别指向s&t的最后一个字符。

indel returns 1. match(a, b) returns 0 if a = b (match) else return 1 (substitution) indel返回1。如果a = b(匹配),则match(a,b)返回0;否则返回1(替换)。

#define MATCH     0       /* enumerated type symbol for match */
#define INSERT    1       /* enumerated type symbol for insert */
#define DELETE    2       /* enumerated type symbol for delete */

int string_compare(char *s, char *t, int i, int j)
{
    int k;                  /* counter */
    int opt[3];             /* cost of the three options */
    int lowest_cost;        /* lowest cost */

    // base case, if i is 0, then we reached start of s and 
    // now it's empty, so there would be j * 1 edit distance between s & t
    // think of it if s is initially empty and t is not, how many
    // edits we need to perform on s to be similar to t? answer is where
    // we are at t right now which is j
    if (i == 0) return(j * indel(' '));
    // same reasoning as above but for s instead of t
    if (j == 0) return(i * indel(' '));

    // calculate opt[match] by checking if s[i] = t[j] which = 0 if true or 1 if not
    // then recursively do the same for s[i-1] & t[j-1]
    opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
    // calculate opt[insert] which is how many chars we need to insert 
    // in s to make it looks like t, or look at it from the other way,
    // how many chars we need to delete from t to make it similar to s?
    // since we're deleting from t, we decrease j by 1 and leave i (pointer
    // in s) as is + indel(t[j]) which we deleted (always returns 1)
    opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
    // same reasoning as before but deleting from s or inserting into t
    opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);

    // these lines are just to pick the min of opt[match], opt[insert], and
    // opt[delete]
    lowest_cost = opt[MATCH];
    for (k=INSERT; k<=DELETE; k++)
            if (opt[k] < lowest_cost) lowest_cost = opt[k];

    return( lowest_cost );
}

The algorithm is not hard to understand, you just need to read it couple of times. 该算法并不难理解,您只需阅读几次即可。 What's always amuse me is the person who invented it and the trust that recursion will do the right thing. 一直让我感到困惑的是发明它的人,以及对递归将做正确的事情的信任。

This is likely a non-issue for the OP by now, but I'll write down my understanding of the text. 到目前为止,这对于OP来说可能不是问题,但是我会写下对文本的理解。

/**
 * Returns the cost of a substitution(match) operation
 */
int match(char c, char d)
{
  if (c == d) return 0
  else return 1
}

/**
 * Returns the cost of an insert/delete operation(assumed to be a constant operation)
 */
int indel(char c)
{
  return 1
}

The edit distance is essentially the minimum number of modifications on a given string, required to transform it into another reference string. 编辑距离本质上是将给定字符串转换为另一个参考字符串所需的最小修改次数。 The modifications,as you know, can be the following. 如您所知,可以进行以下修改。

  1. Substitution (Replacing a single character) 替换(替换单个字符)
  2. Insert (Insert a single character into the string) 插入(在字符串中插入一个字符)
  3. Delete (Deleting a single character from the string) 删除(从字符串中删除单个字符)

Now, 现在,

Properly posing the question of string similarity requires us to set the cost of each of these string transform operations. 正确提出字符串相似性问题需要我们设置每个字符串转换操作的成本。 Assigning each operation an equal cost of 1 defines the edit distance between two strings. 为每个操作分配相等的成本1会定义两个字符串之间的编辑距离。

So that establishes that each of the three modifications known to us have a constant cost, O(1). 这样就可以确定我们已知的三个修改中的每一个都有不变的成本O(1)。

But how do we know where to modify? 但是我们怎么知道在哪里修改呢?

We instead look for modifications that may or may not be needed from the end of the string, character by character. 相反,我们从字符串的末尾逐字符寻找可能需要或不需要的修改。 So, 所以,

  1. We count all substitution operations, starting from the end of the string 我们计算所有从字符串末尾开始的替换操作
  2. We count all delete operations, starting from the end of the string 我们计算从字符串末尾开始的所有删除操作
  3. We count all insert operations, starting from the end of the string 我们计算从字符串末尾开始的所有插入操作

Finally, once we have this data, we return the minimum of the above three sums. 最后,一旦有了这些数据,我们将返回上述三个总和的最小值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM