简体   繁体   English

算法:有趣的差异算法

[英]Algorithms: Interesting diffing algorithm

This came up in a real-world situation, and I thought I would share it, as it could lead to some interesting solutions. 这出现在现实世界的情况下,我想我会分享它,因为它可能会导致一些有趣的解决方案。 Essentially, the algorithm needs to diff two lists, but let me give you a more rigorous definition of the problem. 本质上,算法需要区分两个列表,但是让我给出一个更严格的问题定义。

Mathematical Formulation 数学公式

Suppose you have two lists, L and R each of which contain elements from some underlying alphabet S . 假设您有两个列表, LR每个列表包含来自某些底层字母S元素。 Moreover, these lists have the property that the common elements that they have appear in order: that is to say, if L[i] = R[i*] and L[j] = R[j*] , and i < j then i * < j *. 此外,这些列表具有它们按顺序出现的公共元素的属性:也就是说,如果L[i] = R[i*]L[j] = R[j*] ,并且i < j那么i * < j *。 The lists need not have any common elements at all, and one or both may be empty. 列表根本不需要任何共同元素,并且一个或两个可以是空的。 [ Clarification: You may assume no repetitions of elements. [ 澄清:你可以假设没有重复的元素。 ] ]

The problem is to produce a sort of "diff" of the lists, which may be viewed as new list of ordered pairs (x,y) where x is from L and y is from R , with the following properties: 问题是产生一种列表的“差异”,可以将其视为有序对(x,y)新列表,其中x来自Ly来自R ,具有以下属性:

  1. If x appears in both lists, then (x,x) appears in the result. 如果x出现在两个列表中,则结果中会出现(x,x)
  2. If x appears in L , but not in R , then (x,NULL) appears in the result. 如果x出现在L ,而不出现在R ,则结果中会出现(x,NULL)
  3. If y appears in R , but not in L , then (NULL,y) appears in the result. 如果y出现在R ,但不出现在L ,则结果中会出现(NULL,y)

and finally 最后

  • The result list has "the same" ordering as each of the input lists: it shares, roughly speaking, the same ordering property as above with each of the lists individually (see example). 结果列表与每个输入列表具有“相同”的排序:粗略地说,它与上面的每个列表分别具有相同的排序属性(参见示例)。

Examples 例子

L = (d)
R = (a,b,c)
Result = ((NULL,d), (a,NULL), (b,NULL), (c,NULL))

L = (a,b,c,d,e)  
R = (b,q,c,d,g,e)
Result = ((a,NULL), (b,b), (NULL,q), (c,c), (d,d), (NULL,g), (e,e))

Does anyone have any good algorithms to solve this? 有没有人有任何好的算法来解决这个问题? What is the complexity? 复杂性是什么?

There is a way to do this in O(n), if you're willing to make a copy of one of the lists in a different data structure. 如果您愿意在不同的数据结构中复制其中一个列表,则可以在O(n)中执行此操作。 This is a classic time/space tradeoff. 这是一个经典的时间/空间权衡。

Create a hash map of the list R, with the key being the element and the value being the original index into the array; 创建列表R的哈希映射,其中键是元素,值是数组的原始索引; in C++, you could use unordered_map from tr1 or boost. 在C ++中,您可以使用tr1中的unordered_map或boost。

Keep an index to the unprocessed portion of list R, initialized to the first element. 保留列表R的未处理部分的索引,初始化为第一个元素。

For each element in list L, check the hash map for a match in list R. If you do not find one, output (L value, NULL). 对于列表L中的每个元素,检查列表R中匹配的哈希映射。如果找不到,则输出(L值,NULL)。 If there is a match, get the corresponding index from the hash map. 如果匹配,则从哈希映射中获取相应的索引。 For each unprocessed element in list R up to the matching index, output (NULL, R value). 对于列表R中的每个未处理元素,直到匹配索引,输出(NULL,R值)。 For the match, output (value, value). 对于匹配,输出(值,值)。

When you have reached the end of list L, go through the remaining elements of list R and output (NULL, R value). 当您到达列表L的末尾时,请浏览列表R的剩余元素并输出(NULL,R值)。

Edit: Here is the solution in Python. 编辑:这是Python的解决方案。 To those who say this solution depends on the existence of a good hashing function - of course it does. 对于那些说这个解决方案取决于是否存在良好的散列函数的人 - 当然它确实如此。 The original poster may add additional constraints to the question if this is a problem, but I will take an optimistic stance until then. 如果这是一个问题,原始海报可能会对问题增加额外的限制,但在此之前我会采取乐观的态度。

def FindMatches(listL, listR):
    result=[]
    lookupR={}
    for i in range(0, len(listR)):
        lookupR[listR[i]] = i
    unprocessedR = 0
    for left in listL:
        if left in lookupR:
            for right in listR[unprocessedR:lookupR[left]]:
                result.append((None,right))
            result.append((left,left))
            unprocessedR = lookupR[left] + 1
        else:
            result.append((left,None))
    for right in listR[unprocessedR:]:
        result.append((None,right))
    return result

>>> FindMatches(('d'),('a','b','c'))
[('d', None), (None, 'a'), (None, 'b'), (None, 'c')]
>>> FindMatches(('a','b','c','d','e'),('b','q','c','d','g','e'))
[('a', None), ('b', 'b'), (None, 'q'), ('c', 'c'), ('d', 'd'), (None, 'g'), ('e','e')]

The worst case, as defined and using only equality, must be O(n*m). 最坏的情况,如定义和仅使用相等,必须是O(n * m)。 Consider the following two lists: 考虑以下两个列表:

A[] = {a,b,c,d,e,f,g} A [] = {a,b,c,d,e,f,g}

B[] = {h,i,j,k,l,m,n} B [] = {h,i,j,k,l,m,n}

Assume there exists exactly one match between those two "ordered" lists. 假设这两个“有序”列表之间只存在一个匹配。 It will take O(n*m) comparisons since there does not exist a comparison which removes the need for other comparisons later. 它将进行O(n * m)比较,因为不存在比较,这消除了以后进行其他比较的需要。

So, any algorithm you come up with is going to be O(n*m), or worse. 所以,你提出的任何算法都将是O(n * m),或更糟。

Diffing ordered lists can be done in linear time by traversing both lists and matching as you go. 通过遍历列表和匹配,可以在线性时间内完成差异排序列表。 I will try to post some psuedo Java code in an update. 我将尝试在更新中发布一些伪造的Java代码。

Since we don't know the ordering algorithm and can't determine any ordering based on less than or greater than operators, we must consider the lists unordered. 由于我们不知道排序算法并且无法确定基于小于或大于运算符的任何排序,因此我们必须考虑无序列表。 Also, given how the results are to be formatted you are faced with scanning both lists (at least until you find a match and then you can bookmark and start from there again). 另外,考虑到如何格式化结果,您将面临扫描两个列表(至少在您找到匹配项之前,然后您可以添加书签并再次从那里开始)。 It will still be O(n^2) performance, or yes more specifically O(nm). 它仍然是O(n ^ 2)性能,或更具体地是O(nm)。

This is exactly like sequence alignment, you can use the Needleman-Wunsch algorithm to solve it. 这与序列比对完全一样,您可以使用Needleman-Wunsch算法来解决它。 The link includes the code in Python. 该链接包含Python中的代码。 Just make sure you set the scoring so that a mismatch is negative and a match is positive and an alignment with a blank is 0 when maximizing. 只需确保设置得分,使得不匹配为负且匹配为正,并且最大化时与空白的对齐为0。 The algorithm runs in O(n * m) time and space, but the space complexity of this can be improved. 该算法在O(n * m)时间和空间中运行,但是可以改善其空间复杂度。

Scoring Function 评分功能

int score(char x, char y){
    if ((x == ' ') || (y == ' ')){
        return 0;
    }
    else if (x != y){
        return -1;
    }
    else if (x == y){
        return 1;
    }
    else{
        puts("Error!");
        exit(2);
    }
}

Code

#include <stdio.h>
#include <stdbool.h>

int max(int a, int b, int c){
    bool ab, ac, bc;
    ab = (a > b);
    ac = (a > c);
    bc = (b > c);
    if (ab && ac){
        return a;
    }
    if (!ab && bc){
        return b;
    }
    if (!ac && !bc){
        return c;
    }
}

int score(char x, char y){
    if ((x == ' ') || (y == ' ')){
        return 0;
    }
    else if (x != y){
        return -1;
    }
    else if (x == y){
        return 1;
    }
    else{
        puts("Error!");
        exit(2);
    }
}


void print_table(int **table, char str1[], char str2[]){
    unsigned int i, j, len1, len2;
    len1 = strlen(str1) + 1;
    len2 = strlen(str2) + 1;
    for (j = 0; j < len2; j++){
        if (j != 0){
            printf("%3c", str2[j - 1]);
        }
        else{
            printf("%3c%3c", ' ', ' ');
        }
    }
    putchar('\n');
    for (i = 0; i < len1; i++){
        if (i != 0){
            printf("%3c", str1[i - 1]);
        }
        else{
            printf("%3c", ' ');
        }
        for (j = 0; j < len2; j++){
            printf("%3d", table[i][j]);
        }
        putchar('\n');
    }
}

int **optimal_global_alignment_table(char str1[], char str2[]){
    unsigned int len1, len2, i, j;
    int **table;
    len1 = strlen(str1) + 1;
    len2 = strlen(str2) + 1;
    table = malloc(sizeof(int*) * len1);
    for (i = 0; i < len1; i++){
        table[i] = calloc(len2, sizeof(int));
    }
    for (i = 0; i < len1; i++){
        table[i][0] += i * score(str1[i], ' ');
    }
    for (j = 0; j < len1; j++){
        table[0][j] += j * score(str1[j], ' ');
    }
    for (i = 1; i < len1; i++){
        for (j = 1; j < len2; j++){
            table[i][j] = max(
                table[i - 1][j - 1] + score(str1[i - 1], str2[j - 1]),
                table[i - 1][j] + score(str1[i - 1], ' '),
                table[i][j - 1] + score(' ', str2[j - 1])
            );
        }
    }
    return table;
}

void prefix_char(char ch, char str[]){
    int i;
    for (i = strlen(str); i >= 0; i--){
        str[i+1] = str[i];
    }   
    str[0] = ch;
}

void optimal_global_alignment(int **table, char str1[], char str2[]){
    unsigned int i, j;
    char *align1, *align2;
    i = strlen(str1);
    j = strlen(str2);
    align1 = malloc(sizeof(char) * (i * j));
    align2 = malloc(sizeof(char) * (i * j));
    align1[0] = align2[0] = '\0';
    while((i > 0) && (j > 0)){
        if (table[i][j] == (table[i - 1][j - 1] + score(str1[i - 1], str2[j - 1]))){
            prefix_char(str1[i - 1], align1);
            prefix_char(str2[j - 1], align2);
            i--;
            j--;
        }
        else if (table[i][j] == (table[i - 1][j] + score(str1[i-1], ' '))){
            prefix_char(str1[i - 1], align1);
            prefix_char('_', align2);
            i--;
        }
        else if (table[i][j] == (table[i][j - 1] + score(' ', str2[j - 1]))){
            prefix_char('_', align1);
            prefix_char(str2[j - 1], align2);
            j--;
        }
    }
    while (i > 0){
        prefix_char(str1[i - 1], align1);
        prefix_char('_', align2);
        i--;
    }
    while(j > 0){
        prefix_char('_', align1);
        prefix_char(str2[j - 1], align2);
        j--;
    }
    puts(align1);
    puts(align2);
}

int main(int argc, char * argv[]){
    int **table;
    if (argc == 3){
        table = optimal_global_alignment_table(argv[1], argv[2]);
        print_table(table, argv[1], argv[2]);
        optimal_global_alignment(table, argv[1], argv[2]);
    }
    else{
        puts("Reqires to string arguments!");
    }
    return 0;
}

Sample IO 样本IO

$ cc dynamic_programming.c && ./a.out aab bba
__aab
bb_a_
$ cc dynamic_programming.c && ./a.out d abc
___d
abc_
$ cc dynamic_programming.c && ./a.out abcde bqcdge
ab_cd_e
_bqcdge

This is a pretty simple problem since you already have an ordered list. 这是一个非常简单的问题,因为您已经有了一个有序列表。

//this is very rough pseudocode
stack aList;
stack bList;
List resultList;
char aVal;
char bVal;

while(aList.Count > 0 || bList.Count > 0)
{
  aVal = aList.Peek; //grab the top item in A
  bVal = bList.Peek; //grab the top item in B

  if(aVal < bVal || bVal == null)
  {
     resultList.Add(new Tuple(aList.Pop(), null)));
  }
  if(bVal < aVal || aVal == null)
  {
     resultList.Add(new Tuple(null, bList.Pop()));
  }
  else //equal
  {
     resultList.Add(new Tuple(aList.Pop(), bList.Pop()));
  }
}

Note... this code WILL NOT compile. 注意......此代码不会编译。 It is just meant as a guide. 它只是作为指导。

EDIT Based on the OP comments 编辑基于OP评论

If the ordering algorithm is not exposed, then the lists must be considered unordered. 如果未公开排序算法,则必须将列表视为无序。 If the lists are unordered, then the algorithm has a time complexity of O(n^2), specifically O(nm) where n and m are the number of items in each list. 如果列表是无序的,则算法具有O(n ^ 2)的时间复杂度,特别是O(nm),其中n和m是每个列表中的项目数。

EDIT Algorithm to solve this EDIT算法来解决这个问题

L(a,b,c,d,e) R(b,q,c,d,g,e) L(a,b,c,d,e)R(b,q,c,d,g,e)

//pseudo code... will not compile
//Note, this modifies aList and bList, so make copies.
List aList;
List bList;
List resultList;
var aVal;
var bVal;

while(aList.Count > 0)
{
   aVal = aList.Pop();
   for(int bIndex = 0; bIndex < bList.Count; bIndex++)
   {
      bVal = bList.Peek();
      if(aVal.RelevantlyEquivalentTo(bVal)
      {
         //The bList items that come BEFORE the match, are definetly not in aList
         for(int tempIndex = 0; tempIndex < bIndex; tempIndex++)
         {
             resultList.Add(new Tuple(null, bList.Pop()));
         }
         //This 'popped' item is the same as bVal right now
         resultList.Add(new Tuple(aVal, bList.Pop()));

         //Set aVal to null so it doesn't get added to resultList again
         aVal = null;

         //Break because it's guaranteed not to be in the rest of the list
         break;
      }
   }
   //No Matches
   if(aVal != null)
   {
      resultList.Add(new Tuple(aVal, null));
   }
}
//aList is now empty, and all the items left in bList need to be added to result set
while(bList.Count > 0)
{
   resultList.Add(new Tuple(null, bList.Pop()));
}

The result set will be 结果集将是

L(a,b,c,d,e) R(b,q,c,d,g,e) L(a,b,c,d,e)R(b,q,c,d,g,e)

Result ((a,null),(b,b),(null,q),(c,c),(d,d),(null,g),(e,e)) 结果((a,null),(b,b),(null,q),(c,c),(d,d),(null,g),(e,e))

No real tangible answer, only vague intuition. 没有真正有形的答案,只有模糊的直觉。 Because you don't know the ordering algorithm, only that the data is ordered in each list, it sounds vaguely like the algorithms used to "diff" files (eg in Beyond Compare) and match sequences of lines together. 因为您不知道排序算法,只知道数据在每个列表中排序,它听起来有点像用于“差异”文件的算法(例如在Beyond Compare中)并将线序列匹配在一起。 Or also vaguely similar to regexp algorithms. 或者也与regexp算法模糊相似。

There can also be multiple solutions. 也可以有多种解决方案。 (never mind, not if there are not repeated elements that are strictly ordered. I was thinking too much along the lines of file comparisons) (没关系,如果没有严格排序的重复元素,那就不行了。我在文件比较方面的想法太多了)

I don't think you have enough information. 我认为你没有足够的信息。 All you've asserted is that elements that match match in the same order, but finding the first matching pair is an O(nm) operation unless you have some other ordering that you can determine. 所有你断言的是匹配的元素以相同的顺序匹配,但找到第一个匹配对是O(nm)操作,除非你有其他一些你可以确定的顺序。

SELECT distinct l.element, r.element SELECT distinct l.element,r.element
FROM LeftList l 来自LeftList l
OUTER JOIN RightList r OUTER JOIN RightList r
ON l.element = r.element ON l.element = r.element
ORDER BY l.id, r.id ORDER BY l.id,r.id

Assumes the ID of each element is its ordering. 假设每个元素的ID是它的排序。 And of course, that your lists are contained in a Relational Database :) 当然,您的列表包含在关系数据库中:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM