简体   繁体   English

在 SQL 中的部分字符串上使用 levenshtein

[英]Using levenshtein on parts of string in SQL

I am trying to figure out a way to work some fuzzy searching methods into our store front search field using the Levenshtein method, but I'm running into a problem with how to search for only part of product names.我试图找出一种使用 Levenshtein 方法将一些模糊搜索方法用于我们的店面搜索字段的方法,但我遇到了如何仅搜索部分产品名称的问题。

For example, a customer searches for scisors , but we have a product called electric scissor .例如,客户搜索scisors ,但我们有一个产品叫electric scissor Using the Levenshtein method levenshtein("scisors","electric scissor") we will get a result of 11, because the electric part will be counted as a difference.使用 Levenshtein 方法levenshtein("scisors","electric scissor")我们将得到 11 的结果,因为电动部分将被计为差异。

What I am looking for is a way for it to look at substrings of the product name, so it would compare it to levenshtein("scisors","electric") and then also levenshtein("scisors","scissor") to see that we can get a result of only 2 in that second substring, and thus show that product as part of their search result.我正在寻找的是一种查看产品名称子字符串的方法,因此它会将其与levenshtein("scisors","electric")levenshtein("scisors","scissor")进行比较以查看我们在第二个 substring 中只能得到 2 的结果,因此将该产品显示为搜索结果的一部分。

Non-working example to give you an idea of what I'm after:非工作示例让您了解我所追求的:

SELECT * FROM products p WHERE levenshtein("scisors", p.name) < 5

Question: Is there a way to write an SQL statement that handles checking for parts of the string?问题:有没有办法编写处理检查字符串部分的 SQL 语句? Would I need to create more functions in my database to be able to handle it perhaps or modify my existing function, and if so, what would it look like?我是否需要在我的数据库中创建更多函数才能处理它,或者修改我现有的 function,如果需要,它会是什么样子?

I am currently using this implementation of the levenshtein method:我目前正在使用 levenshtein 方法的这个实现:

//levenshtein(s1 as VARCHAR(255), s2 as VARCHAR(255))
//returns int


  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
      RETURN 0;
    ELSEIF s1_len = 0 THEN
      RETURN s2_len;
    ELSEIF s2_len = 0 THEN
      RETURN s1_len;
    ELSE
      WHILE j <= s2_len DO
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
      END WHILE;
      WHILE i <= s1_len DO
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
        WHILE j <= s2_len DO
          SET c = c + 1;
          IF s1_char = SUBSTRING(s2, j, 1) THEN 
            SET cost = 0; ELSE SET cost = 1;
          END IF;
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
          IF c > c_temp THEN SET c = c_temp; END IF;
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
            IF c > c_temp THEN 
              SET c = c_temp; 
            END IF;
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
        END WHILE;
        SET cv1 = cv0, i = i + 1;
      END WHILE;
    END IF;
    RETURN c;
  END

This is a bit long for a comment.这是一个有点长的评论。

First, I would suggest using a full-text search with a synonyms list.首先,我建议使用带有同义词列表的全文搜索。 That said, you might have users with really bad spelling abilities, so the synonyms list might be difficult to maintain.也就是说,您的用户可能拼写能力很差,因此同义词列表可能难以维护。

If you use Levenshtein distance, then I suggest doing it on a per word basis.如果您使用 Levenshtein 距离,那么我建议您按单词进行。 For each word in the user's input, calculate the closest word in the name field.对于用户输入中的每个单词,计算name字段中最接近的单词 Then add these together to get the best match.然后将它们加在一起以获得最佳匹配。

In your example, you would have these comparisons:在您的示例中,您将进行以下比较:

  • levenshtein('scisors', 'electric') levenshtein('剪刀','电动')
  • levenshtein('scisors', 'scissor') levenshtein('剪刀', '剪刀')

The minimum would be the second.最小值将是第二个。 If the user types multiple words, such as 'electrk scisors' , then you would be doing如果用户键入多个单词,例如'electrk scisors' ,那么你会做

  • levenshtein('electrk', 'electric') <-- minimum levenshtein('electrk', 'electric') <-- 最小值
  • levenshtein('electrk', 'scissor') levenshtein('electrk', '剪刀')
  • levenshtein('scisors', 'electric') levenshtein('剪刀','电动')
  • levenshtein('scisors', 'scissor') <-- minimum levenshtein('scisors', 'scissor') <-- 最小值

This is likely to be an intuitive way to approach the search.这可能是一种接近搜索的直观方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM