简体   繁体   English

最长的共同子序列:这为什么错?

[英]longest common subsequence: why is this wrong?

int lcs(char * A, char * B)
{
  int m = strlen(A);
  int n = strlen(B);
  int *X = malloc(m * sizeof(int));
  int *Y = malloc(n * sizeof(int));
  int i;
  int j;
  for (i = m; i >= 0; i--)
    {
      for (j = n; j >= 0; j--)
        {
          if (A[i] == '\0' || B[j] == '\0') 
              X[j] = 0;
          else if (A[i] == B[j]) 
              X[j] = 1 + Y[j+1];
          else 
              X[j] = max(Y[j], X[j+1]);
        }
      Y = X;
    }
  return X[0];
}

This works, but valgrind complains loudly about invalid reads. 这有效,但是valgrind大声抱怨无效读取。 How was I messing up the memory? 我是怎么弄乱记忆的? Sorry, I always fail at C memory allocation. 抱歉,我总是在C内存分配上失败。

The issue here is with the size of your table. 这里的问题是表的大小。 Note that you're allocating space as 请注意,您正在将空间分配为

int *X = malloc(m * sizeof(int));
int *Y = malloc(n * sizeof(int));

However, you are using indices 0 ... m and 0 ... n, which means that there are m + 1 slots necessary in X and n + 1 slots necessary in Y. 但是,您使用的索引为0 ... m和0 ... n,这意味着X中需要m + 1个插槽,Y中需要n + 1个插槽。

Try changing this to read 尝试将其更改为阅读

int *X = malloc((m + 1) * sizeof(int));
int *Y = malloc((n + 1) * sizeof(int));

Hope this helps! 希望这可以帮助!

Series of issues. 系列问题。 First, as templatetypedef says, you're under-allocated. 首先,正如templatetypedef所说,您的分配不足。

Then, as paddy says, you're not freeing up your malloc'd memory. 然后,就像paddy所说的那样,您并没有释放malloc分配的内存。 If you need the Y=X line, you'll need to store the original malloc'd space addresses in another set of variables so you can call free on them. 如果需要Y=X行,则需要将原始的malloc分配的空间地址存储在另一组变量中,以便可以对其进行free调用。

...mallocs...
int * original_y = Y;
int * original_x = X;
...body of code...
free(original_y);
free(original_x);
return X[0];

But this doesn't address your new question, which is why doesn't the code actually work? 但这不能解决您的新问题,这就是为什么代码实际上不起作用?

I admit I can't follow your code (without a lot more study), but I can propose an algorithm that will work and be far more understandable. 我承认我无法遵循您的代码(没有做更多的研究),但是我可以提出一种行之有效且容易理解的算法。 This may be somewhat pseudocode and not particularly efficient, but getting it correct is the first step. 这可能是伪代码 ,并不是特别有效,但是第一步是使其正确。 I've listed some optimizations later. 我稍后列出了一些优化。

int lcs(char * A, char * B)
{
  int length_a = strlen(A);
  int length_b = strlen(B);


  // these hold the position in A of the longest common substring
  int longest_found_length = 0;

  // go through each substring of one of the strings (doesn't matter which, you could pick the shorter one if you want)
  char * candidate_substring = malloc(sizeof(char) * length_a + 1);
  for (int start_position = 0; start_position < length_a; start_position++) {
    for (int end_position = start_position; end_position < length_a; end_position++) {

       int substring_length = end_position - start_position + 1;

       // make a null-terminated copy of the substring to look for in the other string
       strncpy(candidate_substring, &(A[start_position]), substring_length);
       if (strstr(B, candidate_substring) != NULL) {
         longest_found_length = substring_length;
       }
    }

  }
  free(candidate_substring);
  return longest_found_length;
}

Some different optimizations you could do: 您可以执行一些不同的优化:

       // if this can't be longer, then don't bother checking it.  You can play games with the for loop to not have this happen, but it's more complicated.
       if (substring_length <= longest_found_index) {
         continue;
       }

and

       // there are more optimizations you could do to this, but don't check
       //   the substring if it's longer than b, since b can't contain it.
       if (substring_length > length_b) {
         continue;
       } 

and

   if (strstr(B, candidate_substring) != NULL) {
     longest_found_length = end_position - start_position + 1;
   } else {
     // if nothing contains the shorter string, then nothing can contain the longer one, so skip checking longer strings with the same starting character
     break; // skip out of inner loop to next iteration of start_position
   }

Instead of copying each candidate substring to a new string, you could do a character swap with the end_position + 1 and a NUL character. 不必将每个候选子字符串复制到新字符串,而是可以使用end_position + 1NUL字符进行字符交换。 Then, after looking for that substring in b, swap the original character at end_position+1 back in. This would be much faster, but complicates the implementation a little. 然后,在b中查找该子字符串后,将end_position+1处的原始字符end_position+1 。这会快得多,但会使实现复杂一些。

NOTE: I don't normally write two answers and if you feel that it is tacky, feel free to comment on this one and note vote it up. 注意:我通常不会写两个答案,如果您觉得它很俗气,请随时对此一句话发表评论并注意投票。 This answer is a more optimized solution, but I wanted to give the most straightforward one I could think of first and then put this in another answer to not confuse the two. 这个答案是一个更优化的解决方案,但是我想给出一个我能想到的最简单的答案,然后将其放在另一个答案中,以免混淆两者。 Basically they are for different audiences. 基本上,它们是针对不同的受众的。

The key to solving this problem efficiently is to not throw away information you have about shorter common substrings when looking for longer ones. 有效解决此问题的关键是在寻找较长的子字符串时,不要丢弃有关较短的公共子字符串的信息。 Naively, you check each substring against the other one, but if you know that "AB" matches in "ABC", and your next character is C, don't check to see if "ABC" is in "ABC", just check that the spot after "AB" is a "C". 天真地,您将每个子字符串与另一个子字符串进行检查,但是如果您知道“ AB”在“ ABC”中匹配,并且您的下一个字符是C,则无需检查“ ABC”是否在“ ABC”中,只需检查一下即可“ AB”后面的点是“ C”。

For each character in A, you have to check up to all the letters in B, but because we stop looking through B once a longer substring is no longer possible, it greatly limits the number of checks. 对于A中的每个字符,您必须最多检查B中的所有字母,但是由于一旦不再有更长的子字符串,我们将停止浏览B,因此大大限制了检查次数。 Each time you get a longer match up front, you eliminate checks on the back-end, because it will no longer be a longer substring. 每当您在前面获得更长的匹配项时,就无需在后端进行检查,因为它不再是更长的子字符串。

For example, if A and B are both long, but contain no common letters, each letter in A will be compared against each letter in B for a runtime of A*B. 例如,如果A和B都很长,但不包含公共字母,则在A * B运行时,将A中的每个字母与B中的每个字母进行比较。

For a sequence where there are a lot of matches, but the match length isn't a large fraction of the length of the shorter string, you have A * B combinations to check against the shorter of the two strings (A or B) leading to either A*B*A or A*B*B, which is basically O(n^3) time for similar length strings. 对于有很多匹配项但匹配长度不是较短字符串长度很大一部分的序列,您可以使用A * B组合来检查两个字符串(A或B)中较短的字符串到A * B * A或A * B * B,对于相似长度的字符串,这基本上是O(n ^ 3)时间。 I really thought the optimizations in this solution would be better than n^3 even though there are triple-nested for loops, but it appears to not be as best as I can tell. 我确实认为,即使存在三重嵌套的for循环,此解决方案中的优化也会比n ^ 3更好,但它似乎并不如我所知。

I'm thinking about this some more, though. 不过,我还在考虑这一点。 Either the substrings being found are NOT a significant fraction of the length of the strings, in which case the optimizations don't do much, but the comparisons for each combination of A*B don't scale with A or B and drop out to be constants -- OR -- they are a significant fraction of A and B and it directly divides against the A*B combinations that have to be compared. 要么找到的子字符串都不占字符串长度的很大一部分,在这种情况下,优化不会做很多事情,但是对A * B的每个组合的比较都不会与A或B进行比例缩放,从而得出是常数-或-它们是A和B的重要部分,并且直接除以必须比较的A * B组合。

I just may ask this in a question. 我可能会问一个问题。

int lcs(char * A, char * B)
{
  int length_a = strlen(A);
  int length_b = strlen(B);

  // these hold the position in A of the longest common substring
  int longest_length_found = 0;

  // for each character in one string (doesn't matter which), look for incrementally larger strings in the other
  for (int a_index = 0; a_index < length_a - longest_length_found; a_index++) {
    for (int b_index = 0; b_index < length_b - longest_length_found; b_index++) {

       // offset into each string until end of string or non-matching character is found
      for (int offset = 0; A[a_index+offset] != '\0' && B[b_index+offset] != '\0' && A[a_index+offset] == B[b_index+offset]; offset++) {          
        longest_length_found = longest_length_found > offset ? longest_length_found : offset;
      }
    }
  }
  return longest_found_length;
}

In addition to what templatetypedef said, some things to think about: 除了templatetypedef所说的以外,还需要考虑以下几点:

  • Why aren't X and Y the same size? 为什么XY的大小不一样?
  • Why are you doing Y = X ? 为什么要做Y = X That's an assignment of pointers. 那是指针的分配。 Did you perhaps mean memcpy(Y, X, (n+1)*sizeof(int)) ? 你可能是说memcpy(Y, X, (n+1)*sizeof(int))吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM