简体   繁体   中英

Algorithms: Interesting diffing algorithm

This came up in a real-world situation, and I thought I would share it, as it could lead to some interesting solutions. Essentially, the algorithm needs to diff two lists, but let me give you a more rigorous definition of the problem.

Mathematical Formulation

Suppose you have two lists, L and R each of which contain elements from some underlying alphabet S . Moreover, these lists have the property that the common elements that they have appear in order: that is to say, if L[i] = R[i*] and L[j] = R[j*] , and i < j then i * < j *. The lists need not have any common elements at all, and one or both may be empty. [ Clarification: You may assume no repetitions of elements. ]

The problem is to produce a sort of "diff" of the lists, which may be viewed as new list of ordered pairs (x,y) where x is from L and y is from R , with the following properties:

  1. If x appears in both lists, then (x,x) appears in the result.
  2. If x appears in L , but not in R , then (x,NULL) appears in the result.
  3. If y appears in R , but not in L , then (NULL,y) appears in the result.

and finally

  • The result list has "the same" ordering as each of the input lists: it shares, roughly speaking, the same ordering property as above with each of the lists individually (see example).

Examples

L = (d)
R = (a,b,c)
Result = ((NULL,d), (a,NULL), (b,NULL), (c,NULL))

L = (a,b,c,d,e)  
R = (b,q,c,d,g,e)
Result = ((a,NULL), (b,b), (NULL,q), (c,c), (d,d), (NULL,g), (e,e))

Does anyone have any good algorithms to solve this? What is the complexity?

There is a way to do this in O(n), if you're willing to make a copy of one of the lists in a different data structure. This is a classic time/space tradeoff.

Create a hash map of the list R, with the key being the element and the value being the original index into the array; in C++, you could use unordered_map from tr1 or boost.

Keep an index to the unprocessed portion of list R, initialized to the first element.

For each element in list L, check the hash map for a match in list R. If you do not find one, output (L value, NULL). If there is a match, get the corresponding index from the hash map. For each unprocessed element in list R up to the matching index, output (NULL, R value). For the match, output (value, value).

When you have reached the end of list L, go through the remaining elements of list R and output (NULL, R value).

Here is the solution in Python. 这是Python的解决方案。 To those who say this solution depends on the existence of a good hashing function - of course it does. The original poster may add additional constraints to the question if this is a problem, but I will take an optimistic stance until then.

def FindMatches(listL, listR):
    result=[]
    lookupR={}
    for i in range(0, len(listR)):
        lookupR[listR[i]] = i
    unprocessedR = 0
    for left in listL:
        if left in lookupR:
            for right in listR[unprocessedR:lookupR[left]]:
                result.append((None,right))
            result.append((left,left))
            unprocessedR = lookupR[left] + 1
        else:
            result.append((left,None))
    for right in listR[unprocessedR:]:
        result.append((None,right))
    return result

>>> FindMatches(('d'),('a','b','c'))
[('d', None), (None, 'a'), (None, 'b'), (None, 'c')]
>>> FindMatches(('a','b','c','d','e'),('b','q','c','d','g','e'))
[('a', None), ('b', 'b'), (None, 'q'), ('c', 'c'), ('d', 'd'), (None, 'g'), ('e','e')]

The worst case, as defined and using only equality, must be O(n*m). Consider the following two lists:

A[] = {a,b,c,d,e,f,g}

B[] = {h,i,j,k,l,m,n}

Assume there exists exactly one match between those two "ordered" lists. It will take O(n*m) comparisons since there does not exist a comparison which removes the need for other comparisons later.

So, any algorithm you come up with is going to be O(n*m), or worse.

Diffing ordered lists can be done in linear time by traversing both lists and matching as you go. I will try to post some psuedo Java code in an update.

Since we don't know the ordering algorithm and can't determine any ordering based on less than or greater than operators, we must consider the lists unordered. Also, given how the results are to be formatted you are faced with scanning both lists (at least until you find a match and then you can bookmark and start from there again). It will still be O(n^2) performance, or yes more specifically O(nm).

This is exactly like sequence alignment, you can use the Needleman-Wunsch algorithm to solve it. The link includes the code in Python. Just make sure you set the scoring so that a mismatch is negative and a match is positive and an alignment with a blank is 0 when maximizing. The algorithm runs in O(n * m) time and space, but the space complexity of this can be improved.

Scoring Function

int score(char x, char y){
    if ((x == ' ') || (y == ' ')){
        return 0;
    }
    else if (x != y){
        return -1;
    }
    else if (x == y){
        return 1;
    }
    else{
        puts("Error!");
        exit(2);
    }
}

Code

#include <stdio.h>
#include <stdbool.h>

int max(int a, int b, int c){
    bool ab, ac, bc;
    ab = (a > b);
    ac = (a > c);
    bc = (b > c);
    if (ab && ac){
        return a;
    }
    if (!ab && bc){
        return b;
    }
    if (!ac && !bc){
        return c;
    }
}

int score(char x, char y){
    if ((x == ' ') || (y == ' ')){
        return 0;
    }
    else if (x != y){
        return -1;
    }
    else if (x == y){
        return 1;
    }
    else{
        puts("Error!");
        exit(2);
    }
}


void print_table(int **table, char str1[], char str2[]){
    unsigned int i, j, len1, len2;
    len1 = strlen(str1) + 1;
    len2 = strlen(str2) + 1;
    for (j = 0; j < len2; j++){
        if (j != 0){
            printf("%3c", str2[j - 1]);
        }
        else{
            printf("%3c%3c", ' ', ' ');
        }
    }
    putchar('\n');
    for (i = 0; i < len1; i++){
        if (i != 0){
            printf("%3c", str1[i - 1]);
        }
        else{
            printf("%3c", ' ');
        }
        for (j = 0; j < len2; j++){
            printf("%3d", table[i][j]);
        }
        putchar('\n');
    }
}

int **optimal_global_alignment_table(char str1[], char str2[]){
    unsigned int len1, len2, i, j;
    int **table;
    len1 = strlen(str1) + 1;
    len2 = strlen(str2) + 1;
    table = malloc(sizeof(int*) * len1);
    for (i = 0; i < len1; i++){
        table[i] = calloc(len2, sizeof(int));
    }
    for (i = 0; i < len1; i++){
        table[i][0] += i * score(str1[i], ' ');
    }
    for (j = 0; j < len1; j++){
        table[0][j] += j * score(str1[j], ' ');
    }
    for (i = 1; i < len1; i++){
        for (j = 1; j < len2; j++){
            table[i][j] = max(
                table[i - 1][j - 1] + score(str1[i - 1], str2[j - 1]),
                table[i - 1][j] + score(str1[i - 1], ' '),
                table[i][j - 1] + score(' ', str2[j - 1])
            );
        }
    }
    return table;
}

void prefix_char(char ch, char str[]){
    int i;
    for (i = strlen(str); i >= 0; i--){
        str[i+1] = str[i];
    }   
    str[0] = ch;
}

void optimal_global_alignment(int **table, char str1[], char str2[]){
    unsigned int i, j;
    char *align1, *align2;
    i = strlen(str1);
    j = strlen(str2);
    align1 = malloc(sizeof(char) * (i * j));
    align2 = malloc(sizeof(char) * (i * j));
    align1[0] = align2[0] = '\0';
    while((i > 0) && (j > 0)){
        if (table[i][j] == (table[i - 1][j - 1] + score(str1[i - 1], str2[j - 1]))){
            prefix_char(str1[i - 1], align1);
            prefix_char(str2[j - 1], align2);
            i--;
            j--;
        }
        else if (table[i][j] == (table[i - 1][j] + score(str1[i-1], ' '))){
            prefix_char(str1[i - 1], align1);
            prefix_char('_', align2);
            i--;
        }
        else if (table[i][j] == (table[i][j - 1] + score(' ', str2[j - 1]))){
            prefix_char('_', align1);
            prefix_char(str2[j - 1], align2);
            j--;
        }
    }
    while (i > 0){
        prefix_char(str1[i - 1], align1);
        prefix_char('_', align2);
        i--;
    }
    while(j > 0){
        prefix_char('_', align1);
        prefix_char(str2[j - 1], align2);
        j--;
    }
    puts(align1);
    puts(align2);
}

int main(int argc, char * argv[]){
    int **table;
    if (argc == 3){
        table = optimal_global_alignment_table(argv[1], argv[2]);
        print_table(table, argv[1], argv[2]);
        optimal_global_alignment(table, argv[1], argv[2]);
    }
    else{
        puts("Reqires to string arguments!");
    }
    return 0;
}

Sample IO

$ cc dynamic_programming.c && ./a.out aab bba
__aab
bb_a_
$ cc dynamic_programming.c && ./a.out d abc
___d
abc_
$ cc dynamic_programming.c && ./a.out abcde bqcdge
ab_cd_e
_bqcdge

This is a pretty simple problem since you already have an ordered list.

//this is very rough pseudocode
stack aList;
stack bList;
List resultList;
char aVal;
char bVal;

while(aList.Count > 0 || bList.Count > 0)
{
  aVal = aList.Peek; //grab the top item in A
  bVal = bList.Peek; //grab the top item in B

  if(aVal < bVal || bVal == null)
  {
     resultList.Add(new Tuple(aList.Pop(), null)));
  }
  if(bVal < aVal || aVal == null)
  {
     resultList.Add(new Tuple(null, bList.Pop()));
  }
  else //equal
  {
     resultList.Add(new Tuple(aList.Pop(), bList.Pop()));
  }
}

Note... this code WILL NOT compile. It is just meant as a guide.

EDIT Based on the OP comments

If the ordering algorithm is not exposed, then the lists must be considered unordered. If the lists are unordered, then the algorithm has a time complexity of O(n^2), specifically O(nm) where n and m are the number of items in each list.

EDIT Algorithm to solve this

L(a,b,c,d,e) R(b,q,c,d,g,e)

//pseudo code... will not compile
//Note, this modifies aList and bList, so make copies.
List aList;
List bList;
List resultList;
var aVal;
var bVal;

while(aList.Count > 0)
{
   aVal = aList.Pop();
   for(int bIndex = 0; bIndex < bList.Count; bIndex++)
   {
      bVal = bList.Peek();
      if(aVal.RelevantlyEquivalentTo(bVal)
      {
         //The bList items that come BEFORE the match, are definetly not in aList
         for(int tempIndex = 0; tempIndex < bIndex; tempIndex++)
         {
             resultList.Add(new Tuple(null, bList.Pop()));
         }
         //This 'popped' item is the same as bVal right now
         resultList.Add(new Tuple(aVal, bList.Pop()));

         //Set aVal to null so it doesn't get added to resultList again
         aVal = null;

         //Break because it's guaranteed not to be in the rest of the list
         break;
      }
   }
   //No Matches
   if(aVal != null)
   {
      resultList.Add(new Tuple(aVal, null));
   }
}
//aList is now empty, and all the items left in bList need to be added to result set
while(bList.Count > 0)
{
   resultList.Add(new Tuple(null, bList.Pop()));
}

The result set will be

L(a,b,c,d,e) R(b,q,c,d,g,e)

Result ((a,null),(b,b),(null,q),(c,c),(d,d),(null,g),(e,e))

No real tangible answer, only vague intuition. Because you don't know the ordering algorithm, only that the data is ordered in each list, it sounds vaguely like the algorithms used to "diff" files (eg in Beyond Compare) and match sequences of lines together. Or also vaguely similar to regexp algorithms.

There can also be multiple solutions. (never mind, not if there are not repeated elements that are strictly ordered. I was thinking too much along the lines of file comparisons)

I don't think you have enough information. All you've asserted is that elements that match match in the same order, but finding the first matching pair is an O(nm) operation unless you have some other ordering that you can determine.

SELECT distinct l.element, r.element
FROM LeftList l
OUTER JOIN RightList r
ON l.element = r.element
ORDER BY l.id, r.id

Assumes the ID of each element is its ordering. And of course, that your lists are contained in a Relational Database :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM