简体   繁体   中英

Finding most similar sentence match

I have a large dataset containing a mix of words and short phrases, such as:

dataset = [
    "car",
    "red-car",
    "lorry",
    "broken lorry",
    "truck owner",
    "train",
    ...
]

I am trying to find a way to determine the most similar word from a short sentence, such as:

input = "I love my car that is red"   # should map to "red-car"
input = "I purchased a new lorry"     # should map to "lorry"
input = "I hate my redcar"            # should map to "red-car"
input = "I will use my truck"         # should map to "truck owner"
input = "Look at that yellow lorri"   # should map to "lorry"

I have tried a number of approaches to this with no avail, including:

Vectoring the dataset and the input using TfidfVectorizer, then calculating the Cosine similarity of the vectorized input value against each individual, vectorized item value from the dataset .

The problem is, this only really works if the input contains the exact word(s) that are in the dataset - so for example, in the case where the input = "trai" then it would have a cosine value of 0, whereas I am trying to get it to map to the value "train" in the dataset.

The most obvious solution would be to perform a simple spell check, but that may not be a valid option, because I still want to choose the most similar result, even when the words are slightly different, ie:

input = "broke"    # should map to "broken lorry" given the above dataset

If someone could suggest other potential approach I could try, that would be much appreciated.

As @Aaalok has suggested in the comments, one idea is to use a different distance/similarity function. Possible candidates include

  • Levenshtein distance (measures the number of changes to transform one string into the other)
  • N-gram similarity (measures the number of shared n-grams between both strings)

Another possibility is feature generation , ie enhancing the items in your dataset with additional strings. These could be n-grams, stems, or whatever suits your needs. For example, you could (automatically) expand red-car into

red-car red car

Paragraph vector or doc2vec should solve your problem. Provided you've enough and proper dataset. Of course, you'll have to do lot of tuning to get your results right. You could try gensim/deeplearning4j. But you may have to use some other methods to manage spelling mistakes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM