简体   繁体   English

使用Java中的Levenshtein距离改善搜索结果

[英]Improving search result using Levenshtein distance in Java

I have following working Java code for searching for a word against a list of words and it works perfectly and as expected: 我有以下工作Java代码,用于搜索单词列表中的单词,并且它完美地工作并且符合预期:

public class Levenshtein {
    private int[][] wordMartix;

    public Set similarExists(String searchWord) {

        int maxDistance = searchWord.length();
        int curDistance;
        int sumCurMax;
        String checkWord;

        // preventing double words on returning list
        Set<String> fuzzyWordList = new HashSet<>();

        for (Object wordList : Searcher.wordList) {
            checkWord = String.valueOf(wordList);
            curDistance = calculateDistance(searchWord, checkWord);
            sumCurMax = maxDistance + curDistance;
            if (sumCurMax == checkWord.length()) {
                fuzzyWordList.add(checkWord);
            }
        }
        return fuzzyWordList;
    }

    public int calculateDistance(String inputWord, String checkWord) {
        wordMartix = new int[inputWord.length() + 1][checkWord.length() + 1];

        for (int i = 0; i <= inputWord.length(); i++) {
            wordMartix[i][0] = i;
        }

        for (int j = 0; j <= checkWord.length(); j++) {
            wordMartix[0][j] = j;
        }

        for (int i = 1; i < wordMartix.length; i++) {
            for (int j = 1; j < wordMartix[i].length; j++) {
                if (inputWord.charAt(i - 1) == checkWord.charAt(j - 1)) {
                    wordMartix[i][j] = wordMartix[i - 1][j - 1];
                } else {
                    int minimum = Integer.MAX_VALUE;
                    if ((wordMartix[i - 1][j]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j]) + 1;
                    }

                    if ((wordMartix[i][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i][j - 1]) + 1;
                    }

                    if ((wordMartix[i - 1][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j - 1]) + 1;
                    }

                    wordMartix[i][j] = minimum;
                }
            }
        }

        return wordMartix[inputWord.length()][checkWord.length()];
    }

}

Right now when I search for a word like job it returns a list: 现在,当我搜索像job这样的单词时,它返回一个列表:

Output 产量

joborienterede
jobannoncer
jobfunktioner
perjacobsen
jakobsen
jobprofiler
jacob
jobtitler
jobbet
jobdatabaserne
jobfunktion
jakob
jobs
studenterjobber
johannesburg
jobmuligheder
jobannoncerne
jobbaser
job
joberfaringer

As you can see the output has a lot of related words but has also non-related ones like jakob , jacob etc., which is correct regarding the Levenshtein formula, but I would like to build further and write a method that can fine tune my search so I can get more relevant and related words. 你可以看到输出有很多相关的单词,但也有非相关的单词,如jakobjacob等,这对于Levenshtein公式是正确的,但我想进一步构建并编写一个可以微调我的方法搜索,以便我可以获得更多相关和相关的单词。

I have worked few hours on it and lost my sight of creativity. 我已经工作了几个小时,失去了创造力。

My Question: Is it possible to fine tune the existing method to return relevant/related words Or should I take another approach Or??? 我的问题:是否有可能微调现有方法以返回相关/相关的字或者我应该采取另一种方法或??? in all cases YES or NO, I appreciated if can get input and inspiration regarding improving searching results? 在所有情况下是或否,我很欣赏是否可以获得有关改善搜索结果的输入和灵感?


UPDATE UPDATE

After asking this question long time back I have not really found a solution and I back to it because it is time where I need a useful answer, it is fine to supply the answer with JAVA code samples, but what is most important is a detailed answer with description of available methods and approaches used to index best and most relevant search results and ignoring none relevant words . 在长时间回答这个问题之后我还没有真正找到解决方案,我回到它,因为是时候我需要一个有用的答案, 可以用JAVA代码样本提供答案,但最重要的是详细的回答可用方法和方法的描述,用于索引最佳和最相关的搜索结果,并忽略任何相关的单词 I know this is an open and endless area, but I need to have some inspiration to start some where. 我知道这是一个开放和无穷无尽的领域,但我需要一些灵感来开始一些地方。

Note: The oldest answer right now is based on one of the comment inputs and is not helpful (useless), it just sorting the distance, that does not mean getting better search results/quality. 注意:现在最老的答案是基于其中一个评论输入而没有帮助(没用),它只是对距离进行排序,这并不意味着获得更好的搜索结果/质量。

So I did distance sorting and the results was like this: 所以我进行了距离排序,结果是这样的:

job
jobs
jacob
jakob
jobbet
jakobsen
jobbaser
jobtitler
jobannoncer
jobfunktion
jobprofiler
perjacobsen
johannesburg
jobannoncerne
joberfaringer
jobfunktioner
jobmuligheder
jobdatabaserne
joborienterede
studenterjobber

so word jobbaser is relevant and jacob/jakob is not relevant, but the distance for jobbaser is bigger than jacob/jakob. 所以word jobbaser是相关的,jacob / jakob是不相关的,但jobbaser的距离大于jacob / jakob。 So that did not really helped. 所以这并没有真正帮助。


General feedback regarding answers 有关答案的一般反馈

  • @SergioMontoro, it solves almost the problem. @SergioMontoro,它几乎解决了这个问题。
  • @uSeemSurprised, it solves the problem but need continually manipulation. @uSeemSurprised,它解决了问题,但需要不断操纵。
  • @Gene concept is excellent, but it is relaying on external url. @Gene的概念非常好,但它在外部网址上传播。

Thanks I would like to personally thanks all of you who contributed to this question, I have got nice answers and useful comments. 谢谢我个人感谢所有为此问题做出贡献的人,我得到了很好的答案和有用的评论。

Special thanks to answers from @SergioMontoro, @uSeemSurprised and @Gene, those are different but valid and useful answers. 特别感谢@SergioMontoro,@ uSeemSurprised和@Gene的答案,这些答案是不同但有效且有用的答案。

@D.Kovács is pointing some interesting solution. @D.Kovács指出了一些有趣的解决方案。

I wish I could give bounty to all of those answers. 我希望我能给予所有这些答案赏金。 Chose one answer and give it bounty, that does not mean the other answers is not valid, but that only mean that the particular answer I chose was useful for me. 选择一个答案并给予赏金,这并不意味着其他答案无效,但这只意味着我选择的特定答案对我有用。

Without understanding the meaning of the words like @DrYap suggests, the next logical unit to compare two words (if you are not looking for misspellings) is syllables. 如果不理解@DrYap建议的单词的含义,那么比较两个单词的下一个逻辑单元(如果你不是在寻找拼写错误)就是音节。 It is very easy to modify Levenshtein to compare syllables instead of characters. 修改Levenshtein以比较音节而不是字符非常容易。 The hard part is breaking the words into syllables. 困难的部分是将单词分解为音节。 There is a Java implementation TeXHyphenator-J which can be used to split the words. 有一个Java实现TeXHyphenator-J ,可用于分割单词。 Based on this hyphenation library, here is a modified version of Levenshtein function written by Michael Gilleland & Chas Emerick . 基于这个连字库,这里是Michael Gilleland和Chas Emerick编写的Levenshtein函数的修改版本。 More about syllable detection here and here . 更多关于这里这里的音节检测。 Of course, you'll want to avoid syllable comparison of two single syllable words probably handling this case with standard Levenshtein. 当然,你要避免使用标准Levenshtein来处理这种情况的两个单音节词的音节比较。

import net.davidashen.text.Hyphenator;

public class WordDistance {

    public static void main(String args[]) throws Exception {
        Hyphenator h = new Hyphenator();
        h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));
        getSyllableLevenshteinDistance(h, args[0], args[1]);
    }

    /**
     * <p>
     * Calculate Syllable Levenshtein distance between two words </p>
     * The Syllable Levenshtein distance is defined as the minimal number of
     * case-insensitive syllables you have to replace, insert or delete to transform word1 into word2.
     * @return int
     * @throws IllegalArgumentException if either str1 or str2 is <b>null</b>
     */
    public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) {
        if (s == null || t == null)
            throw new NullPointerException("Strings must not be null");

        final String hyphen = Character.toString((char) 173);
        final String[] ss = h.hyphenate(s).split(hyphen);
        final String[] st = h.hyphenate(t).split(hyphen);

        final int n = ss.length;
        final int m = st.length;

        if (n == 0)
            return m;
        else if (m == 0)
            return n;

        int p[] = new int[n + 1]; // 'previous' cost array, horizontally
        int d[] = new int[n + 1]; // cost array, horizontally

        for (int i = 0; i <= n; i++)
            p[i] = i;

        for (int j = 1; j <= m; j++) {
            d[0] = j;
            for (int i = 1; i <= n; i++) {
                int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
            }
            // copy current distance counts to 'previous row' distance counts
            int[] _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts
        return p[n];
    }

}

You can modify Levenshtein Distance by adjusting the scoring when consecutive characters match. 您可以通过在连续字符匹配时调整评分来修改Levenshtein距离。

Whenever there are consecutive characters that match, the score can then be reduced thus making the search more relevent. 只要存在匹配的连续字符,就可以减少分数,从而使搜索更加相关。

eg : Lets say the factor by which we want to reduce score by is 10 then if in a word we find the substring "job" we can reduce the score by 10 when we encounter "j" furthur reduce it by (10 + 20) when we find the string "jo" and finally reduce the score by (10 + 20 + 30) when we find "job". 例如:让我们说我们想要降低得分的因子是10然后如果总之我们发现子串“作业”我们可以将得分减少10当我们遇到“j”时将其减少(10 + 20)当我们找到字符串“jo”并最终在我们找到“工作”时将得分减少(10 + 20 + 30)。

I have written a c++ code below : 我在下面写了一个c ++代码:

#include <bits/stdc++.h>

#define INF -10000000
#define FACTOR 10

using namespace std;

double memo[100][100][100];

double Levenshtein(string inputWord, string checkWord, int i, int j, int count){
    if(i == inputWord.length() && j == checkWord.length()) return 0;    
    if(i == inputWord.length()) return checkWord.length() - j;
    if(j == checkWord.length()) return inputWord.length() - i;
    if(memo[i][j][count] != INF) return memo[i][j][count];

    double ans1 = 0, ans2 = 0, ans3 = 0, ans = 0;
    if(inputWord[i] == checkWord[j]){
        ans1 = Levenshtein(inputWord, checkWord, i+1, j+1, count+1) - (FACTOR*(count+1));
        ans2 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans3 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, min(ans2, ans3));
    }else{
        ans1 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans2 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, ans2);
    }
    return memo[i][j][count] = ans;
}

int main(void) {
    // your code goes here
    string word = "job";
    string wordList[40];
    vector< pair <double, string> > ans;
    for(int i = 0;i < 40;i++){
        cin >> wordList[i];
        for(int j = 0;j < 100;j++) for(int k = 0;k < 100;k++){
            for(int m = 0;m < 100;m++) memo[j][k][m] = INF;
        }
        ans.push_back( make_pair(Levenshtein(word, wordList[i], 
            0, 0, 0), wordList[i]) );
    }
    sort(ans.begin(), ans.end());
    for(int i = 0;i < ans.size();i++){
        cout << ans[i].second << " " << ans[i].first << endl;
    }
    return 0;
}

Link to demo : http://ideone.com/4UtCX3 链接到演示: http//ideone.com/4UtCX3

Here the FACTOR is taken as 10, you can experiment with other words and choose the appropriate value. 这里FACTOR为10,您可以尝试其他单词并选择合适的值。

Also note that the complexity of the above Levenshtein Distance has also increased, it is now O(n^3) instead of O(n^2) as now we are also keeping track of the counter that counts how many consecutive characters we have encountered. 还要注意上面的Levenshtein距离的复杂性也增加了,现在它是O(n^3)而不是O(n^2)因为现在我们也跟踪计算我们遇到的连续字符数的计数器。

You can further play with the score by increasing it gradually after you find some consecutive substring and then a mismatch, instead of the current way where we have a fixed score of 1 that is added to the overall score. 您可以在找到一些连续的子串然后不匹配后逐渐增加分数,而不是当前我们将固定分数1添加到总分中的方式。

Also in the above solution you can remove the strings that have score >=0 as they are not at all releavent you can also choose some other threshold for that to have a more accurate search. 同样在上面的解决方案中,你可以删除得分> = 0的字符串,因为它们根本不是释放的,你也可以选择一些其他阈值来获得更准确的搜索。

Since you asked, I'll show how the UMBC semantic network can do at this kind of thing. 既然你问过,我将展示UMBC语义网络如何做到这一点。 Not sure it's what you really want: 不确定这是你真正想要的:

import static java.lang.String.format;
import static java.util.Comparator.comparingDouble;
import static java.util.stream.Collectors.toMap;
import static java.util.function.Function.identity;

import java.util.Map.Entry;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.regex.Pattern;

public class SemanticSimilarity {
  private static final String GET_URL_FORMAT
      = "http://swoogle.umbc.edu/SimService/GetSimilarity?"
          + "operation=api&phrase1=%s&phrase2=%s";
  private static final Pattern VALID_WORD_PATTERN = Pattern.compile("\\w+");
  private static final String[] DICT = {
    "cat",
    "building",
    "girl",
    "ranch",
    "drawing",
    "wool",
    "gear",
    "question",
    "information",
    "tank" 
  };

  public static String httpGetLine(String urlToRead) throws IOException {
    URL url = new URL(urlToRead);
    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
    conn.setRequestMethod("GET");
    try (BufferedReader reader = new BufferedReader(
        new InputStreamReader(conn.getInputStream()))) {
      return reader.readLine();
    }
  }

  public static double getSimilarity(String a, String b) {
    if (!VALID_WORD_PATTERN.matcher(a).matches()
        || !VALID_WORD_PATTERN.matcher(b).matches()) {
      throw new RuntimeException("Bad word");
    }
    try {
      return Double.parseDouble(httpGetLine(format(GET_URL_FORMAT, a, b)));
    } catch (IOException | NumberFormatException ex) {
      return -1.0;
    }
  }

  public static void test(String target) throws IOException {
    System.out.println("Target: " + target);
    Arrays.stream(DICT)
        .collect(toMap(identity(), word -> getSimilarity(target, word)))
        .entrySet().stream()
        .sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
        .forEach(System.out::println);
    System.out.println();
  }

  public static void main(String[] args) throws Exception {
    test("sheep");
    test("vehicle");
    test("house");
    test("data");
    test("girlfriend");
  }
}

The results are kind of fascinating: 结果有点迷人:

Target: sheep
ranch=0.38563728
cat=0.37816614
wool=0.36558008
question=0.047607
girl=0.0388761
information=0.027191084
drawing=0.0039623436
tank=0.0
building=0.0
gear=0.0

Target: vehicle
tank=0.65860236
gear=0.2673374
building=0.20197356
cat=0.06057514
information=0.041832563
ranch=0.017701812
question=0.017145569
girl=0.010708235
wool=0.0
drawing=0.0

Target: house
building=1.0
ranch=0.104496084
tank=0.103863
wool=0.059761923
girl=0.056549154
drawing=0.04310725
cat=0.0418914
gear=0.026439993
information=0.020329408
question=0.0012588014

Target: data
information=0.9924584
question=0.03476312
gear=0.029112043
wool=0.019744944
tank=0.014537057
drawing=0.013742204
ranch=0.0
cat=0.0
girl=0.0
building=0.0

Target: girlfriend
girl=0.70060706
ranch=0.11062875
cat=0.09766617
gear=0.04835723
information=0.02449007
wool=0.0
question=0.0
drawing=0.0
tank=0.0
building=0.0

I tried the suggestion from the comments about sorting the matches by the distance returned by Levenshtein algo, and it seems it does produce better results. 我尝试了关于按照Levenshtein Algo返回的距离对匹配进行排序的评论中的建议,看起来它确实产生了更好的结果。

(As I could not find how I could not find the Searcher class from your code, I took the liberty of using a different source of wordlist, Levenshtein implementation, and language.) (由于我无法找到我从代码中找不到Searcher类的原因,我冒昧地使用了不同的词源列表,Levenshtein实现和语言。)

Using the word list provided in Ubuntu, and Levenshtein algo implementation from - https://github.com/ztane/python-Levenshtein , I created a small script that asks for a word and prints all closest words and distance as tuple. 使用Ubuntu中提供的单词列表,以及来自https://github.com/ztane/python-Levenshtein的 Levenshtein algo实现,我创建了一个小脚本,要求输入一个单词并将所有最接近的单词和距离打印为元组。

Code - https://gist.github.com/atdaemon/9f59ad886c35024bdd28 代码 - https://gist.github.com/atdaemon/9f59ad886c35024bdd28

from Levenshtein import distance
import os

def read_dict() :
    with open('/usr/share/dict/words','r') as f : 
        for line in f :
            yield str(line).strip()

inp = str(raw_input('Enter a word : '))

wordlist = read_dict()
matches = []
for word in wordlist :
    dist = distance(inp,word)
    if dist < 3 :
        matches.append((dist,word))
print os.linesep.join(map(str,sorted(matches)))

Sample output - 样品输出 -

Enter a word : job
(0, 'job')
(1, 'Bob')
(1, 'Job')
(1, 'Rob')
(1, 'bob')
(1, 'cob')
(1, 'fob')
(1, 'gob')
(1, 'hob')
(1, 'jab')
(1, 'jib')
(1, 'jobs')
(1, 'jog')
(1, 'jot')
(1, 'joy')
(1, 'lob')
(1, 'mob')
(1, 'rob')
(1, 'sob')
...

Enter a word : checker
(0, 'checker')
(1, 'checked')
(1, 'checkers')
(2, 'Becker')
(2, 'Decker')
(2, 'cheaper')
(2, 'cheater')
(2, 'check')
(2, "check's")
(2, "checker's")
(2, 'checkered')
(2, 'checks')
(2, 'checkup')
(2, 'cheeked')
(2, 'cheekier')
(2, 'cheer')
(2, 'chewer')
(2, 'chewier')
(2, 'chicer')
(2, 'chicken')
(2, 'chocked')
(2, 'choker')
(2, 'chucked')
(2, 'cracker')
(2, 'hacker')
(2, 'heckler')
(2, 'shocker')
(2, 'thicker')
(2, 'wrecker')

This really is an open-ended question, but I would suggest an alternative approach which uses for example the Smith-Waterman algorithm as described in this SO . 这确实是一个开放式问题,但我建议采用另一种方法,例如使用此SO中描述的Smith-Waterman算法

Another (more light-weight) solution would be to use other distance/similarity metrics from NLP (eg, Cosine similarity or Damerau–Levenshtein distance ). 另一种(更轻量级)解决方案是使用来自NLP的其他距离/相似性度量(例如, 余弦相似性Damerau-Levenshtein距离 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM