简体   繁体   中英

Problems with Levenshtein algorithm in Java

I want to use the Levenshtein algorithm for the following task: if a user on my website searches for some value (he enters characters in a input), I want to instantly check for suggestions with AJAX, like Google Instant does.

I have the impression that the Levenshtein algorithm is way too slow for such a task. To check its behaviour, I first implemented it in Java, printing out the two String s in every recursive call of the method.

public class Levenshtein {
    public static void main(String[] arg){
        String a = "Hallo Zusammen";
        String b = "jfdss Zusammen";

        int res = levenshtein(a, b);

        System.out.println(res);
    }

    public static int levenshtein(String s, String t){
        int len_s = s.length();
        int len_t = t.length();
        int cost = 0;

        System.out.println("s: " + s + ", t: " + t);

        if(len_s>0 && len_t>0){
            if(s.charAt(0) != t.charAt(0)) cost = 1;
        }

        if(len_s == 0){
            return len_t;
        }else{
            if(len_t == 0){
                return len_s;
            }else{
                String news = s.substring(0, s.length()-1);
                String newt = t.substring(0, t.length()-1);
                return min(levenshtein(news, t) + 1,
                            levenshtein(s, newt) + 1,
                            levenshtein(news, newt) + cost);
            }
        }
    }

    public static int min(int a, int b, int c) {
          return Math.min(Math.min(a, b), c);
    }
}

However, here are some points:

  • The check if(len_s>0 && len_t>0) was added by me, because I was getting a StringIndexOutOfBoundsException with above test values.
  • With above test values, the algorithm seems to calculate infinitely

Are there optimizations that can be made on the algorithm to make it work for me, or should I use a completely different one to accomplish the desired task?

1) Few words about Levenshtein distance algorithm improvement

Recursive implementation of Levenshteins distance has exponential complexity .

I'd suggest you to use memoization technique and implement Levenshtein distance without recursion, and reduce complexity to O(N^2) (needs O(N^2) memory)

public static int levenshteinDistance( String s1, String s2 ) {
    return dist( s1.toCharArray(), s2.toCharArray() );
}

public static int dist( char[] s1, char[] s2 ) {

    // distance matrix - to memoize distances between substrings
    // needed to avoid recursion
    int[][] d = new int[ s1.length + 1 ][ s2.length + 1 ];

    // d[i][j] - would contain distance between such substrings:
    // s1.subString(0, i) and s2.subString(0, j)

    for( int i = 0; i < s1.length + 1; i++ ) {
        d[ i ][ 0 ] = i;
    }

    for(int j = 0; j < s2.length + 1; j++) {
        d[ 0 ][ j ] = j;
    }

    for( int i = 1; i < s1.length + 1; i++ ) {
        for( int j = 1; j < s2.length + 1; j++ ) {
            int d1 = d[ i - 1 ][ j ] + 1;
            int d2 = d[ i ][ j - 1 ] + 1;
            int d3 = d[ i - 1 ][ j - 1 ];
            if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                d3 += 1;
            }
            d[ i ][ j ] = Math.min( Math.min( d1, d2 ), d3 );
        }
    }
    return d[ s1.length ][ s2.length ];
}

Or, even better - you may notice, that for each cell in distance matrix - you're need only information about previous line, so you can reduce memory needs to O(N) :

public static int dist( char[] s1, char[] s2 ) {

    // memoize only previous line of distance matrix     
    int[] prev = new int[ s2.length + 1 ];

    for( int j = 0; j < s2.length + 1; j++ ) {
        prev[ j ] = j;
    }

    for( int i = 1; i < s1.length + 1; i++ ) {

        // calculate current line of distance matrix     
        int[] curr = new int[ s2.length + 1 ];
        curr[0] = i;

        for( int j = 1; j < s2.length + 1; j++ ) {
            int d1 = prev[ j ] + 1;
            int d2 = curr[ j - 1 ] + 1;
            int d3 = prev[ j - 1 ];
            if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                d3 += 1;
            }
            curr[ j ] = Math.min( Math.min( d1, d2 ), d3 );
        }

        // define current line of distance matrix as previous     
        prev = curr;
    }
    return prev[ s2.length ];
}

2) Few words about autocomplete

Levenshtein's distance is perferred only if you need to find exact matches.

But what if your keyword would be apple and user typed green apples ? Levenshteins distance between query and keyword would be large ( 7 points ). And Levensteins distance between apple and bcdfghk (dumb string) would be 7 points too!

I'd suggest you to use full-text search engine (eg Lucene ). The trick is - that you have to use n-gram model to represent each keyword.

In few words:
1) you have to represent each keyword as document, which contains n-grams: apple -> [ap, pp, pl, le] .

2) after transforming each keyword to set of n-grams - you have to index each keyword-document by n-gram in your search engine. You'll have to create index like this:

...
ap -> apple, map, happy ...
pp -> apple ...
pl -> apple, place ...
...

3) So you have n-gram index. When you're get query - you have to split it into n-grams . Aftre this - you'll have set of users query n-grams. And all you need - is to match most similar documents from your search engine. In draft approach it would be enough.

4) For better suggest - you may rank results of search-engine by Levenshtein distance.

PS I'd suggest you to look through the book "Introduction to information retrieval" .

You can use Apache Commons Lang3's StringUtils.getLevenshteinDistance() :

Find the Levenshtein distance between two Strings.

This is the number of changes needed to change one String into another, where each change is a single character modification (deletion, insertion or substitution).

The previous implementation of the Levenshtein distance algorithm was from http://www.merriampark.com/ld.htm

Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError which can occur when my Java implementation is used with very large strings.

This implementation of the Levenshtein distance algorithm is from http://www.merriampark.com/ldjava.htm

 StringUtils.getLevenshteinDistance(null, *) = IllegalArgumentException StringUtils.getLevenshteinDistance(*, null) = IllegalArgumentException StringUtils.getLevenshteinDistance("","") = 0 StringUtils.getLevenshteinDistance("","a") = 1 StringUtils.getLevenshteinDistance("aaapppp", "") = 7 StringUtils.getLevenshteinDistance("frog", "fog") = 1 StringUtils.getLevenshteinDistance("fly", "ant") = 3 StringUtils.getLevenshteinDistance("elephant", "hippo") = 7 StringUtils.getLevenshteinDistance("hippo", "elephant") = 7 StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8 StringUtils.getLevenshteinDistance("hello", "hallo") = 1

There is an open-source library, java-util ( https://github.com/jdereg/java-util ) that has a StringUtilities.levenshteinDistance(string1, string2) API that is implemented in O(N^2) complexity and uses memory only proportional to O(N) [as discussed above].

This library also includes damerauLevenshteinDisance() as well. Damerau-Levenshtein counts the character transposition (swap) as one edit, where as proper levenshtein counts it as two edits. The downside to Damerau-Levenshtein is that it is does not have triangular equality like the original levenshtein.

Great depiction of triangular equality:

http://richardminerich.com/2012/09/levenshtein-distance-and-the-triangle-inequality/

import java.util.Scanner;

public class Algorithmm {
    public static void main(String args[])
    {
        Scanner sc= new Scanner(System.in);
        System.out.println("Enter the correct string ");
        String correct=sc.nextLine();
        System.out.println("Enter the incorrect string ");
        String incorrect=sc.nextLine();
        int i=correct.length(),j=incorrect.length();
        ++i ; ++j;
        int a[][] = new int[i][j];
        int b[] = new int[3];       
        for(int m=0;m<i;m++)
            for(int n=0;n<j;n++)
            {

                        if(m==0 || n==0)
                        {
                          a[0][n]=n;
                          a[m][0]=m;
                        }
                        else
                        {
                            b[0]=a[m-1][n-1]; b[1]=a[m-1][n]; b[2]=a[m][n-1];


                            if ( correct.charAt(m-1) == incorrect.charAt(n-1)  )
                            {
                                a[m][n]=a[m-1][n-1];
                            }

                            else
                            {
                                for(int t=0;t<2;t++)
                                    for(int u=0;u<2-t;u++)
                                        if(b[u]>b[u+1])
                                            b[u]=b[u+1];


                                a[m][n]=b[0]+1;


                            }

                        }

            }


        for(int m=0;m<i;m++)
        {
            for(int n=0;n<j;n++)
                System.out.print( a[m][n] +"  ");  
            System.out.print("\n");                
        }



        System.out.println(" Levenshtein distance :  "+a[i-1][j-1]);

    }

}
public class Algorithmm {
    public static void main(String args[])
    {
        Scanner sc= new Scanner(System.in);
        System.out.println("Enter the correct string ");
        String correct=sc.nextLine();
        System.out.println("Enter the incorrect string ");
        String incorrect=sc.nextLine();
        int i=correct.length(),j=incorrect.length();
        ++i ; ++j;
        int a[][] = new int[i][j];
        int b[] = new int[3];       
        for(int m=0;m<i;m++)
            for(int n=0;n<j;n++)
            {               
                        if(m==0 || n==0)
                        {
                           a[0][n]=n;
                           a[m][0]=m;
                        }
                        else
                        {
                            b[0]=a[m-1][n-1]; b[1]=a[m-1][n]; b[2]=a[m][n-1];    
                            if ( correct.charAt(m-1) == incorrect.charAt(n-1)  )                        
                                a[m][n]=a[m-1][n-1];                                                        
                            else
                            {
                       //instead of using the above code for finding the smallest number in       the array 'b' we can simplyfy that code to the following, so that we can reduce the execution time.//

                                if(  (b[0]<=b[1]) && (b[0])<=b[2]  )
                                    a[m][n]=b[0]+1;
                                else if(  (b[1]<=b[0]) && (b[1])<=b[2]  )
                                    a[m][n]=b[1]+1;
                                else
                                    a[m][n]=b[2]+1;    
                            }                            
                        }                
            }               
        for(int m=0;m<i;m++)
        {
            for(int n=0;n<j;n++)
                System.out.print( a[m][n] +"  ");  
            System.out.print("\n");                
        }       
        System.out.println("
Levenshtein distance :
  "+a[i-1][j-1]);        
    }
}

You can use Package org.apache.commons.text.similarity

which is better than writing your own Levenshtein.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM