简体   繁体   English

计算两个弦之间的levenshtein距离

[英]Calculating levenshtein distance between two strings

Im executing the following Postgres query. 我正在执行以下Postgres查询。

SELECT *  FROM description WHERE levenshtein(desci, 'Description text?') <= 6  LIMIT 10;

Im using the following code execute the above query. 我使用以下代码执行上面的查询。

public static boolean authQuestion(String question) throws SQLException{
    boolean isDescAvailable = false;
    Connection connection = null;
    try {
        connection = DbRes.getConnection();
        String query = "SELECT *  FROM description WHERE levenshtein(desci, ? ) <= 6";
        PreparedStatement checkStmt = dbCon.prepareStatement(query);
        checkStmt.setString(1, question);
        ResultSet rs = checkStmt.executeQuery();
        while (rs.next()) {     
            isDescAvailable = true;
        }
    } catch (URISyntaxException e1) {
        e1.printStackTrace();
    } catch (SQLException sqle) {
        sqle.printStackTrace();
    } catch (Exception e) {
        if (connection != null)
            connection.close();
    } finally {
        if (connection != null)
            connection.close();
    }
    return isDescAvailable;
}

I want to find the edit distance between both input text and the values that's existing in the database. 我想找到输入文本和数据库中存在的值之间的编辑距离。 i want to fetch all datas that has edit distance of 60 percent. 我想获取编辑距离为60%的所有数据。 The above query doesnt work as expected. 上述查询无法按预期工作。 How do I get the rows that contains 60 percent similarity? 如何获得包含60%相似度的行?

The most general version of the levenshtein function is: levenshtein函数的最通用版本是:

levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int

Both source and target can be any non-null string, with a maximum of 255 characters. source和target都可以是任何非空字符串,最多255个字符。 The cost parameters specify how much to charge for a character insertion, deletion, or substitution, respectively. 成本参数分别指定字符插入,删除或替换的费用。 You can omit the cost parameters, as in the second version of the function; 您可以省略成本参数,如在函数的第二个版本中; in that case they all default to 1. 在这种情况下,他们都默认为1。

So, with the default cost parameters, the result you get is the total number of characters you need to change (by insertion, deletion, or substitution) in the source to get the target . 因此,使用默认参数的成本,你得到的结果是,你需要改变(通过插入,删除或替换)的字符总数source ,以获得target

If you need to calculate the percentage difference, you should divide the levenshtein function result by the length of your source text (or target length - according to your definition of the percentage difference). 如果需要计算百分比差异,则应将levenshtein函数结果除以源文本的长度(或目标长度 - 根据您对百分比差异的定义)。

Use this: 用这个:

SELECT *
FROM description
WHERE 100 * (length(desci) - levenshtein(desci, ?))
         / length(desci) > 60

The Levenshtein distance is the count of how many letters must change (move, delete or insert) for one string to become the other. Levenshtein距离是一个字符串变为另一个字符串必须更改(移动,删除或插入)的字母数。 Put simply, it's the number of letters that are different . 简而言之,它是不同的字母数。

The number of letters that are the same is then length - levenshtein . 那些相同的字母数是length - levenshtein

To express this as a fraction , divide by the length, ie (length - levenshtein) / length . 要将其表示为分数 ,除以长度,即(length - levenshtein) / length

To express a fraction as a percentage , multiply by 100 . 要将分数表示为百分比 ,请乘以100

I perform the multiplication by 100 first to avoid integer division truncation problems. 我通过执行乘法运算100 第一避免整数除法截断问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM