[英]Similarity String Comparison in Java
我想比较几个字符串,并找到最相似的字符串。 我想知道是否有任何库、方法或最佳实践可以返回哪些字符串与其他字符串更相似。 例如:
这种比较将返回第一个比第二个更相似。
我想我需要一些方法,例如:
double similarityIndex(String s1, String s2)
有没有这样的地方?
编辑:我为什么要这样做? 我正在编写一个脚本,将 MS 项目文件的 output 与一些处理任务的遗留系统的 output 进行比较。 因为遗留系统的字段宽度非常有限,所以在添加值时,描述会被缩写。 我想要一些半自动的方法来查找 MS Project 中的哪些条目与系统上的条目相似,这样我就可以获得生成的密钥。 它有缺点,因为它仍然必须手动检查,但它会节省很多工作
以 0%-100% 的方式计算两个字符串之间相似度的常用方法,如许多库中所使用的,是测量您必须将较长的字符串更改为较短的字符串的程度(以 % 为单位):
/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below
editDistance()
: 上面的editDistance()
函数预计会计算两个字符串之间的编辑距离。 此步骤有多种实现方式,每种实现方式可能更适合特定场景。 最常见的是Levenshtein 距离算法,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能会表现得更好)。
以下是计算编辑距离的两个选项:
apply(CharSequence left, CharSequence rightt)
public class StringSimilarity {
/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
/* // If you have Apache Commons Text, you can use it to calculate the edit distance:
LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// Example implementation of the Levenshtein Edit Distance
// See http://rosettacode.org/wiki/Levenshtein_distance#Java
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
public static void printSimilarity(String s, String t) {
System.out.println(String.format(
"%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
}
public static void main(String[] args) {
printSimilarity("", "");
printSimilarity("1234567890", "1");
printSimilarity("1234567890", "123");
printSimilarity("1234567890", "1234567");
printSimilarity("1234567890", "1234567890");
printSimilarity("1234567890", "1234567980");
printSimilarity("47/2010", "472010");
printSimilarity("47/2010", "472011");
printSimilarity("47/2010", "AB.CDEF");
printSimilarity("47/2010", "4B.CDEFG");
printSimilarity("47/2010", "AB.CDEFG");
printSimilarity("The quick fox jumped", "The fox jumped");
printSimilarity("The quick fox jumped", "The fox");
printSimilarity("kitten", "sitting");
}
}
输出:
1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"
我将Levenshtein 距离算法翻译成 JavaScript:
String.prototype.LevenshteinDistance = function (s2) {
var array = new Array(this.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i] = new Array(s2.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i][0] = i;
for (var j = 0; j < s2.length + 1; j++)
array[0][j] = j;
for (var i = 1; i < this.length + 1; i++) {
for (var j = 1; j < s2.length + 1; j++) {
if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
else {
array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
}
}
}
return array[this.length][s2.length];
};
确实有很多字符串相似性度量:
你可以在这里找到这些的解释和 java 实现: https : //github.com/tdebatty/java-string-similarity
您可以使用 Levenshtein 距离来计算两个字符串之间的差异。 http://en.wikipedia.org/wiki/Levenshtein_distance
您可以使用apache commons java 库来实现这一点。 看看里面的这两个函数:
- 获取LevenshteinDistance
- 获取模糊距离
感谢第一个回答者,我认为computeEditDistance(s1, s2)有2次计算。 由于花费了大量时间,决定提高代码的性能。 所以:
public class LevenshteinDistance {
public static int computeEditDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0) {
costs[j] = j;
} else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
}
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0) {
costs[s2.length()] = lastValue;
}
}
return costs[s2.length()];
}
public static void printDistance(String s1, String s2) {
double similarityOfStrings = 0.0;
int editDistance = 0;
if (s1.length() < s2.length()) { // s1 should always be bigger
String swap = s1;
s1 = s2;
s2 = swap;
}
int bigLen = s1.length();
editDistance = computeEditDistance(s1, s2);
if (bigLen == 0) {
similarityOfStrings = 1.0; /* both strings are zero length */
} else {
similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
}
//////////////////////////
//System.out.println(s1 + "-->" + s2 + ": " +
// editDistance + " (" + similarityOfStrings + ")");
System.out.println(editDistance + " (" + similarityOfStrings + ")");
}
public static void main(String[] args) {
printDistance("", "");
printDistance("1234567890", "1");
printDistance("1234567890", "12");
printDistance("1234567890", "123");
printDistance("1234567890", "1234");
printDistance("1234567890", "12345");
printDistance("1234567890", "123456");
printDistance("1234567890", "1234567");
printDistance("1234567890", "12345678");
printDistance("1234567890", "123456789");
printDistance("1234567890", "1234567890");
printDistance("1234567890", "1234567980");
printDistance("47/2010", "472010");
printDistance("47/2010", "472011");
printDistance("47/2010", "AB.CDEF");
printDistance("47/2010", "4B.CDEFG");
printDistance("47/2010", "AB.CDEFG");
printDistance("The quick fox jumped", "The fox jumped");
printDistance("The quick fox jumped", "The fox");
printDistance("The quick fox jumped",
"The quick fox jumped off the balcany");
printDistance("kitten", "sitting");
printDistance("rosettacode", "raisethysword");
printDistance(new StringBuilder("rosettacode").reverse().toString(),
new StringBuilder("raisethysword").reverse().toString());
for (int i = 1; i < args.length; i += 2) {
printDistance(args[i - 1], args[i]);
}
}
}
理论上,您可以比较编辑距离。
如果您的字符串变成文档,对我来说听起来像是 抄袭发现者。 也许用这个词搜索会发现一些好东西。
“Programming Collective Intelligence”有一章是关于确定两个文档是否相似。 代码是用 Python 编写的,但它干净且易于移植。
您还可以使用 z 算法来查找字符串中的相似性。 点击这里https://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/
您可以在没有任何库的情况下使用此“Levenshtein Distance”算法:
public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
if (s == null || t == null) {throw new IllegalArgumentException("Strings must not be null");}
int n = s.length();
int m = t.length();
if (n == 0) {
return m;
}
else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
final int[] p = new int[n + 1];
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
int upper_left;
int upper;
char t_j; // jth character of t
int cost;
for (i = 0; i <= n; i++) {
p[i] = i;
}
for (j = 1; j <= m; j++) {
upper_left = p[0];
t_j = t.charAt(j - 1);
p[0] = j;
for (i = 1; i <= n; i++) {
upper = p[i];
cost = s.charAt(i - 1) == t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
p[i] = Math.min(Math.min(p[i - 1] + 1, p[i] + 1), upper_left + cost);
upper_left = upper;
}
}
return p[n];
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.