I have a couple of questions. The codes that i use will be given below.
How does it compares the string meaning does it only compare the characters starting from left to right?
Why is it that the first two items have lesser match percent (am I right to say that?) compared to the third item even though the number of characters that match is more than the third item.
Pay to Co 123 vs Supliersss 123 Output: 0.286
Pay to Co 456 vs C 456 Pte Ltd
output: 0.077
Co 879 vs 87
output: 0.500
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
/* // If you have StringUtils, you can use it to calculate the edit distance:
return (longerLength - StringUtils.getLevenshteinDistance(longer, shorter)) /
(double) longerLength; */
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// Example implementation of the Levenshtein Edit Distance
// See http://rosettacode.org/wiki/Levenshtein_distance#Java
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
public static void printSimilarity(String s, String t) {
System.out.println(String.format(
"%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
}
Thanks!
How does it compares the string meaning does it only compare the characters starting from left to right?
It does not compare the strings only from left to right. It is a really naive approach if you only move from left to right without considering characters which can be inserted (rather than modifying/substituting) for min edits.
I suggest you read the explanation for edit distance here
Why is it that the first two items have lesser match percent (am I right to say that?) compared to the third item even though the number of characters that match is more than the third item.
Number of characters match is not the criteria for measurement of Levenshtein distance. Let me explain to you with an example:
2 strings, length 10
abckkkkkkk and rrrrrrrabc
One way of converting it to other, preserving order of all matching characters, will be inserting 7 r's
and deleting 7 k's
, which requires 14 edits.
Another way is to substitute whole of 1st string with other, which requires 10 edits.
In 1st case (your example), it is good to retain 123
(4 chars) at the end and insert/substitute all at the start. So, distance is 10 and longer string is 14 chars long. So, you output 4/14 = 0.286
In second case, even if it matches 5 chars in order C 456
, but its better to retain t
from Pay to...
only and editing whole of other string, thereby outputting 1/13 = 0.077
Similarly the third.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.