简体   繁体   中英

levenshtein distance algorithm

I have a couple of questions. The codes that i use will be given below.

  1. How does it compares the string meaning does it only compare the characters starting from left to right?

  2. Why is it that the first two items have lesser match percent (am I right to say that?) compared to the third item even though the number of characters that match is more than the third item.

Pay to Co 123 vs Supliersss 123 Output: 0.286

Pay to Co 456 vs C 456 Pte Ltd
output: 0.077

Co 879 vs 87
output: 0.500

public static double similarity(String s1, String s2) {
            String longer = s1, shorter = s2;
            if (s1.length() < s2.length()) { // longer should always have greater length
              longer = s2; shorter = s1;
            }
            int longerLength = longer.length();
            if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
            /* // If you have StringUtils, you can use it to calculate the edit distance:
            return (longerLength - StringUtils.getLevenshteinDistance(longer, shorter)) /
                                       (double) longerLength; */
            return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

          }

          // Example implementation of the Levenshtein Edit Distance
          // See http://rosettacode.org/wiki/Levenshtein_distance#Java
          public static int editDistance(String s1, String s2) {
            s1 = s1.toLowerCase();
            s2 = s2.toLowerCase();

            int[] costs = new int[s2.length() + 1];
            for (int i = 0; i <= s1.length(); i++) {
              int lastValue = i;
              for (int j = 0; j <= s2.length(); j++) {
                if (i == 0)
                  costs[j] = j;
                else {
                  if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1))
                      newValue = Math.min(Math.min(newValue, lastValue),
                          costs[j]) + 1;
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                  }
                }
              }
              if (i > 0)
                costs[s2.length()] = lastValue;
            }
            return costs[s2.length()];
          }


          public static void printSimilarity(String s, String t) {
                System.out.println(String.format(
                  "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
              }

Thanks!

How does it compares the string meaning does it only compare the characters starting from left to right?

It does not compare the strings only from left to right. It is a really naive approach if you only move from left to right without considering characters which can be inserted (rather than modifying/substituting) for min edits.

I suggest you read the explanation for edit distance here

Why is it that the first two items have lesser match percent (am I right to say that?) compared to the third item even though the number of characters that match is more than the third item.

Number of characters match is not the criteria for measurement of Levenshtein distance. Let me explain to you with an example:

2 strings, length 10
abckkkkkkk and rrrrrrrabc

One way of converting it to other, preserving order of all matching characters, will be inserting 7 r's and deleting 7 k's , which requires 14 edits.

Another way is to substitute whole of 1st string with other, which requires 10 edits.

In 1st case (your example), it is good to retain 123 (4 chars) at the end and insert/substitute all at the start. So, distance is 10 and longer string is 14 chars long. So, you output 4/14 = 0.286

In second case, even if it matches 5 chars in order C 456 , but its better to retain t from Pay to... only and editing whole of other string, thereby outputting 1/13 = 0.077

Similarly the third.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM