简体   繁体   中英

Using levenshtein on parts of string in SQL

I am trying to figure out a way to work some fuzzy searching methods into our store front search field using the Levenshtein method, but I'm running into a problem with how to search for only part of product names.

For example, a customer searches for scisors , but we have a product called electric scissor . Using the Levenshtein method levenshtein("scisors","electric scissor") we will get a result of 11, because the electric part will be counted as a difference.

What I am looking for is a way for it to look at substrings of the product name, so it would compare it to levenshtein("scisors","electric") and then also levenshtein("scisors","scissor") to see that we can get a result of only 2 in that second substring, and thus show that product as part of their search result.

Non-working example to give you an idea of what I'm after:

SELECT * FROM products p WHERE levenshtein("scisors", p.name) < 5

Question: Is there a way to write an SQL statement that handles checking for parts of the string? Would I need to create more functions in my database to be able to handle it perhaps or modify my existing function, and if so, what would it look like?

I am currently using this implementation of the levenshtein method:

//levenshtein(s1 as VARCHAR(255), s2 as VARCHAR(255))
//returns int


  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
      RETURN 0;
    ELSEIF s1_len = 0 THEN
      RETURN s2_len;
    ELSEIF s2_len = 0 THEN
      RETURN s1_len;
    ELSE
      WHILE j <= s2_len DO
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
      END WHILE;
      WHILE i <= s1_len DO
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
        WHILE j <= s2_len DO
          SET c = c + 1;
          IF s1_char = SUBSTRING(s2, j, 1) THEN 
            SET cost = 0; ELSE SET cost = 1;
          END IF;
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
          IF c > c_temp THEN SET c = c_temp; END IF;
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
            IF c > c_temp THEN 
              SET c = c_temp; 
            END IF;
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
        END WHILE;
        SET cv1 = cv0, i = i + 1;
      END WHILE;
    END IF;
    RETURN c;
  END

This is a bit long for a comment.

First, I would suggest using a full-text search with a synonyms list. That said, you might have users with really bad spelling abilities, so the synonyms list might be difficult to maintain.

If you use Levenshtein distance, then I suggest doing it on a per word basis. For each word in the user's input, calculate the closest word in the name field. Then add these together to get the best match.

In your example, you would have these comparisons:

  • levenshtein('scisors', 'electric')
  • levenshtein('scisors', 'scissor')

The minimum would be the second. If the user types multiple words, such as 'electrk scisors' , then you would be doing

  • levenshtein('electrk', 'electric') <-- minimum
  • levenshtein('electrk', 'scissor')
  • levenshtein('scisors', 'electric')
  • levenshtein('scisors', 'scissor') <-- minimum

This is likely to be an intuitive way to approach the search.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM