I have a large dataset containing a mix of words and short phrases, such as:
dataset = [
"car",
"red-car",
"lorry",
"broken lorry",
"truck owner",
"train",
...
]
I am trying to find a way to determine the most similar word from a short sentence, such as:
input = "I love my car that is red" # should map to "red-car"
input = "I purchased a new lorry" # should map to "lorry"
input = "I hate my redcar" # should map to "red-car"
input = "I will use my truck" # should map to "truck owner"
input = "Look at that yellow lorri" # should map to "lorry"
I have tried a number of approaches to this with no avail, including:
Vectoring the dataset
and the input
using TfidfVectorizer, then calculating the Cosine similarity of the vectorized input
value against each individual, vectorized item value from the dataset
.
The problem is, this only really works if the input
contains the exact word(s) that are in the dataset - so for example, in the case where the input = "trai"
then it would have a cosine value of 0, whereas I am trying to get it to map to the value "train"
in the dataset.
The most obvious solution would be to perform a simple spell check, but that may not be a valid option, because I still want to choose the most similar result, even when the words are slightly different, ie:
input = "broke" # should map to "broken lorry" given the above dataset
If someone could suggest other potential approach I could try, that would be much appreciated.
As @Aaalok has suggested in the comments, one idea is to use a different distance/similarity function. Possible candidates include
Another possibility is feature generation , ie enhancing the items in your dataset with additional strings. These could be n-grams, stems, or whatever suits your needs. For example, you could (automatically) expand red-car
into
red-car red car
Paragraph vector or doc2vec should solve your problem. Provided you've enough and proper dataset. Of course, you'll have to do lot of tuning to get your results right. You could try gensim/deeplearning4j. But you may have to use some other methods to manage spelling mistakes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.