简体   繁体   English

Java 中 Levenshtein 算法的问题

[英]Problems with Levenshtein algorithm in Java

I want to use the Levenshtein algorithm for the following task: if a user on my website searches for some value (he enters characters in a input), I want to instantly check for suggestions with AJAX, like Google Instant does.我想将Levenshtein 算法用于以下任务:如果我网站上的用户搜索某个值(他在输入中输入字符),我想立即使用 AJAX 检查建议,就像 Google Instant 一样。

I have the impression that the Levenshtein algorithm is way too slow for such a task.我的印象是 Levenshtein 算法对于这样的任务来说太慢了。 To check its behaviour, I first implemented it in Java, printing out the two String s in every recursive call of the method.为了检查它的行为,我首先在 Java 中实现了它,在该方法的每次递归调用中打印出两个String

public class Levenshtein {
    public static void main(String[] arg){
        String a = "Hallo Zusammen";
        String b = "jfdss Zusammen";

        int res = levenshtein(a, b);

        System.out.println(res);
    }

    public static int levenshtein(String s, String t){
        int len_s = s.length();
        int len_t = t.length();
        int cost = 0;

        System.out.println("s: " + s + ", t: " + t);

        if(len_s>0 && len_t>0){
            if(s.charAt(0) != t.charAt(0)) cost = 1;
        }

        if(len_s == 0){
            return len_t;
        }else{
            if(len_t == 0){
                return len_s;
            }else{
                String news = s.substring(0, s.length()-1);
                String newt = t.substring(0, t.length()-1);
                return min(levenshtein(news, t) + 1,
                            levenshtein(s, newt) + 1,
                            levenshtein(news, newt) + cost);
            }
        }
    }

    public static int min(int a, int b, int c) {
          return Math.min(Math.min(a, b), c);
    }
}

However, here are some points:但是,这里有一些要点:

  • The check if(len_s>0 && len_t>0) was added by me, because I was getting a StringIndexOutOfBoundsException with above test values.检查if(len_s>0 && len_t>0)是我添加的,因为我得到了一个带有上述测试值的StringIndexOutOfBoundsException
  • With above test values, the algorithm seems to calculate infinitely使用以上测试值,该算法似乎可以无限计算

Are there optimizations that can be made on the algorithm to make it work for me, or should I use a completely different one to accomplish the desired task?是否可以对算法进行优化以使其适合我,或者我应该使用完全不同的算法来完成所需的任务?

1) Few words about Levenshtein distance algorithm improvement 1)关于 Levenshtein 距离算法改进的几句话

Recursive implementation of Levenshteins distance has exponential complexity . Levenshteins 距离的递归实现具有指数级的复杂性

I'd suggest you to use memoization technique and implement Levenshtein distance without recursion, and reduce complexity to O(N^2) (needs O(N^2) memory)我建议您使用记忆技术并在没有递归的情况下实现 Levenshtein 距离,并将复杂性降低到O(N^2) (需要O(N^2)内存)

public static int levenshteinDistance( String s1, String s2 ) {
    return dist( s1.toCharArray(), s2.toCharArray() );
}

public static int dist( char[] s1, char[] s2 ) {

    // distance matrix - to memoize distances between substrings
    // needed to avoid recursion
    int[][] d = new int[ s1.length + 1 ][ s2.length + 1 ];

    // d[i][j] - would contain distance between such substrings:
    // s1.subString(0, i) and s2.subString(0, j)

    for( int i = 0; i < s1.length + 1; i++ ) {
        d[ i ][ 0 ] = i;
    }

    for(int j = 0; j < s2.length + 1; j++) {
        d[ 0 ][ j ] = j;
    }

    for( int i = 1; i < s1.length + 1; i++ ) {
        for( int j = 1; j < s2.length + 1; j++ ) {
            int d1 = d[ i - 1 ][ j ] + 1;
            int d2 = d[ i ][ j - 1 ] + 1;
            int d3 = d[ i - 1 ][ j - 1 ];
            if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                d3 += 1;
            }
            d[ i ][ j ] = Math.min( Math.min( d1, d2 ), d3 );
        }
    }
    return d[ s1.length ][ s2.length ];
}

Or, even better - you may notice, that for each cell in distance matrix - you're need only information about previous line, so you can reduce memory needs to O(N) :或者,甚至更好 - 您可能会注意到,对于距离矩阵中的每个单元格 - 您只需要有关前一行的信息,因此您可以将内存需求减少到O(N)

public static int dist( char[] s1, char[] s2 ) {

    // memoize only previous line of distance matrix     
    int[] prev = new int[ s2.length + 1 ];

    for( int j = 0; j < s2.length + 1; j++ ) {
        prev[ j ] = j;
    }

    for( int i = 1; i < s1.length + 1; i++ ) {

        // calculate current line of distance matrix     
        int[] curr = new int[ s2.length + 1 ];
        curr[0] = i;

        for( int j = 1; j < s2.length + 1; j++ ) {
            int d1 = prev[ j ] + 1;
            int d2 = curr[ j - 1 ] + 1;
            int d3 = prev[ j - 1 ];
            if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                d3 += 1;
            }
            curr[ j ] = Math.min( Math.min( d1, d2 ), d3 );
        }

        // define current line of distance matrix as previous     
        prev = curr;
    }
    return prev[ s2.length ];
}

2) Few words about autocomplete 2)关于自动完成的几句话

Levenshtein's distance is perferred only if you need to find exact matches. Levenshtein 的距离仅在您需要找到精确匹配时才被推荐。

But what if your keyword would be apple and user typed green apples ?但是,如果您的关键字是apple并且用户键入green apples怎么办? Levenshteins distance between query and keyword would be large ( 7 points ).查询和关键字之间的 Levenshteins 距离会很大( 7 分)。 And Levensteins distance between apple and bcdfghk (dumb string) would be 7 points too! applebcdfghk (哑字符串)之间的 Levensteins 距离也将是7 点

I'd suggest you to use full-text search engine (eg Lucene ).我建议您使用全文搜索引擎(例如Lucene )。 The trick is - that you have to use n-gram model to represent each keyword.诀窍是 - 您必须使用n-gram模型来表示每个关键字。

In few words:简而言之:
1) you have to represent each keyword as document, which contains n-grams: apple -> [ap, pp, pl, le] . 1)您必须将每个关键字表示为文档,其中包含 n-gram: apple -> [ap, pp, pl, le]

2) after transforming each keyword to set of n-grams - you have to index each keyword-document by n-gram in your search engine. 2)在将每个关键字转换为一组 n-gram 之后 - 您必须在搜索引擎中按 n-gram索引每个关键字文档 You'll have to create index like this:您必须像这样创建索引:

...
ap -> apple, map, happy ...
pp -> apple ...
pl -> apple, place ...
...

3) So you have n-gram index. 3)所以你有n-gram索引。 When you're get query - you have to split it into n-grams .当您收到查询时 - 您必须将其拆分为 n-grams Aftre this - you'll have set of users query n-grams.在此之后 - 您将有一组用户查询 n-gram。 And all you need - is to match most similar documents from your search engine.您所需要的只是匹配搜索引擎中最相似的文档。 In draft approach it would be enough.在草案方法中就足够了。

4) For better suggest - you may rank results of search-engine by Levenshtein distance. 4)为了更好的建议 - 您可以按 Levenshtein 距离对搜索引擎的结果进行排名。

PS I'd suggest you to look through the book "Introduction to information retrieval" . PS我建议你看一下“信息检索简介”一书。

You can use Apache Commons Lang3's StringUtils.getLevenshteinDistance() :您可以使用Apache Commons Lang3 的StringUtils.getLevenshteinDistance()

Find the Levenshtein distance between two Strings.求两个字符串之间的 Levenshtein 距离。

This is the number of changes needed to change one String into another, where each change is a single character modification (deletion, insertion or substitution).这是将一个字符串更改为另一个字符串所需的更改次数,其中每次更改都是单个字符修改(删除、插入或替换)。

The previous implementation of the Levenshtein distance algorithm was from http://www.merriampark.com/ld.htm Levenshtein 距离算法的先前实现来自http://www.merriampark.com/ld.htm

Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError which can occur when my Java implementation is used with very large strings. Chas Emerick 用 Ja​​va 编写了一个实现,它避免了在我的 Java 实现与非常大的字符串一起使用时可能发生的 OutOfMemoryError。

This implementation of the Levenshtein distance algorithm is from http://www.merriampark.com/ldjava.htm Levenshtein 距离算法的这个实现来自http://www.merriampark.com/ldjava.htm

 StringUtils.getLevenshteinDistance(null, *) = IllegalArgumentException StringUtils.getLevenshteinDistance(*, null) = IllegalArgumentException StringUtils.getLevenshteinDistance("","") = 0 StringUtils.getLevenshteinDistance("","a") = 1 StringUtils.getLevenshteinDistance("aaapppp", "") = 7 StringUtils.getLevenshteinDistance("frog", "fog") = 1 StringUtils.getLevenshteinDistance("fly", "ant") = 3 StringUtils.getLevenshteinDistance("elephant", "hippo") = 7 StringUtils.getLevenshteinDistance("hippo", "elephant") = 7 StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8 StringUtils.getLevenshteinDistance("hello", "hallo") = 1

There is an open-source library, java-util ( https://github.com/jdereg/java-util ) that has a StringUtilities.levenshteinDistance(string1, string2) API that is implemented in O(N^2) complexity and uses memory only proportional to O(N) [as discussed above].有一个开源库 java-util ( https://github.com/jdereg/java-util ),它有一个 StringUtilities.levenshteinDistance(string1, string2) API,它以 O(N^2) 复杂度和仅使用与 O(N) 成比例的内存 [如上所述]。

This library also includes damerauLevenshteinDisance() as well.该库还包括 damerauLevenshteinDisance() 。 Damerau-Levenshtein counts the character transposition (swap) as one edit, where as proper levenshtein counts it as two edits. Damerau-Levenshtein 将字符转置(交换)计为一次编辑,而适当的 levenshtein 将其计为两次编辑。 The downside to Damerau-Levenshtein is that it is does not have triangular equality like the original levenshtein. Damerau-Levenshtein 的缺点是它不像原来的 Levenshtein 那样具有三角等式。

Great depiction of triangular equality:三角等式的精彩描述:

http://richardminerich.com/2012/09/levenshtein-distance-and-the-triangle-inequality/ http://richardminerich.com/2012/09/levenshtein-distance-and-the-triangle-inequality/

import java.util.Scanner;

public class Algorithmm {
    public static void main(String args[])
    {
        Scanner sc= new Scanner(System.in);
        System.out.println("Enter the correct string ");
        String correct=sc.nextLine();
        System.out.println("Enter the incorrect string ");
        String incorrect=sc.nextLine();
        int i=correct.length(),j=incorrect.length();
        ++i ; ++j;
        int a[][] = new int[i][j];
        int b[] = new int[3];       
        for(int m=0;m<i;m++)
            for(int n=0;n<j;n++)
            {

                        if(m==0 || n==0)
                        {
                          a[0][n]=n;
                          a[m][0]=m;
                        }
                        else
                        {
                            b[0]=a[m-1][n-1]; b[1]=a[m-1][n]; b[2]=a[m][n-1];


                            if ( correct.charAt(m-1) == incorrect.charAt(n-1)  )
                            {
                                a[m][n]=a[m-1][n-1];
                            }

                            else
                            {
                                for(int t=0;t<2;t++)
                                    for(int u=0;u<2-t;u++)
                                        if(b[u]>b[u+1])
                                            b[u]=b[u+1];


                                a[m][n]=b[0]+1;


                            }

                        }

            }


        for(int m=0;m<i;m++)
        {
            for(int n=0;n<j;n++)
                System.out.print( a[m][n] +"  ");  
            System.out.print("\n");                
        }



        System.out.println(" Levenshtein distance :  "+a[i-1][j-1]);

    }

}
public class Algorithmm {
    public static void main(String args[])
    {
        Scanner sc= new Scanner(System.in);
        System.out.println("Enter the correct string ");
        String correct=sc.nextLine();
        System.out.println("Enter the incorrect string ");
        String incorrect=sc.nextLine();
        int i=correct.length(),j=incorrect.length();
        ++i ; ++j;
        int a[][] = new int[i][j];
        int b[] = new int[3];       
        for(int m=0;m<i;m++)
            for(int n=0;n<j;n++)
            {               
                        if(m==0 || n==0)
                        {
                           a[0][n]=n;
                           a[m][0]=m;
                        }
                        else
                        {
                            b[0]=a[m-1][n-1]; b[1]=a[m-1][n]; b[2]=a[m][n-1];    
                            if ( correct.charAt(m-1) == incorrect.charAt(n-1)  )                        
                                a[m][n]=a[m-1][n-1];                                                        
                            else
                            {
                       //instead of using the above code for finding the smallest number in       the array 'b' we can simplyfy that code to the following, so that we can reduce the execution time.//

                                if(  (b[0]<=b[1]) && (b[0])<=b[2]  )
                                    a[m][n]=b[0]+1;
                                else if(  (b[1]<=b[0]) && (b[1])<=b[2]  )
                                    a[m][n]=b[1]+1;
                                else
                                    a[m][n]=b[2]+1;    
                            }                            
                        }                
            }               
        for(int m=0;m<i;m++)
        {
            for(int n=0;n<j;n++)
                System.out.print( a[m][n] +"  ");  
            System.out.print("\n");                
        }       
        System.out.println("
Levenshtein distance :
  "+a[i-1][j-1]);        
    }
}

You can use Package org.apache.commons.text.similarity您可以使用包 org.apache.commons.text.similarity

which is better than writing your own Levenshtein.这比编写自己的 Levenshtein 更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM