简体   繁体   中英

find cosine similarity between words

Is it possible to find similarity between two words? For example:

cos_lib = cosine_similarity('kamra', 'cameras')

This gives me an error

ValueError: could not convert string to float: 'kamra'

because I haven't converted the words into numerical vectors. How can I do so? I tried this but it wouldn't work either:

('kamra').toarray()

My aim is to check the similarity with both value(lists) of my dictionary and return the key with the highest similarity. Is that possible?

features = {"CAMERA": ['camera', 'kamras'], "BATTERY": ['batterie', 'battery']}

I also tried this but I am not satisfied with the results:

print(damerau.distance('dual camera', 'cameras'))
print(damerau.distance('kamra', 'battery'))

since the results are 6 and 5. But the similar between the first two strings is more so the distance should be less. That's what I am trying to achieve.

Cosine distance is always defined between two real vectors of same length.

As for words/sentences/strings, there are two kinds of distances:

Minimum Edit Distance: This is the number of changes required to make two words have the same characters. The words need not have any meaning for MED to be defined. For example, the strings abcd and abed have MED = 1 . But they have no real meaning in language.

Semantic distance: This is a measure of how far apart words are in terms of meaning. As such, you need a vocabulary here, on top of which a model is built. Here, words are converted into numerical vectors representing their relative meaning. For example, vectors representing tree and wood would be closer than vectors for king and queen . Vector representations of words can be obtained using common models like Word2Vec or high-end Neural Networks like BERT or GPT-2 . Cosine distance between vector representations is a type of semantic distance. Another type of semantic distance is Euclidean Distance.

Note: In case of semantic representations, all words that do not match any word in the vocabulary (eg kamra , abcxyz ) would all be grouped under one meaning, represening {unknown word}.

For your particular use case, I would suggest running MED to get the most probable word from the vocabulary, followed by some form of semantic distance. You can try some autocorrection APIs for the former.

I'd recommend using a pre-trained model from Gensim . You can can download a pre-trained model and then get the cosine similarity between their two vectors.

import gensim.downloader as api
# overview of all models in gensim: https://github.com/RaRe-Technologies/gensim-data
model_glove = api.load("glove-wiki-gigaword-100")

model_glove.relative_cosine_similarity("politics", "vote")
# output: 0.07345439049627836
model_glove.relative_cosine_similarity("film", "camera")
# output: 0.06281138757741007
model_glove.relative_cosine_similarity("economy", "fart")
# output: -0.01170896437873441

Pretrained models will have a hard time recognising typos though, because they were probably not in the training data. Figuring these out is a separate task from cosine similarity.

model_glove.relative_cosine_similarity("kamra", "cameras")
# output: -0.040658474068872255

The following function might be useful though, if you have several words and you want to have the most similar one from the list:

model_glove.most_similar_to_given("camera", ["kamra", "movie", "politics", "umbrella", "beach"])
# output: 'movie'

Luckily, there are libraries that do exactly that, such as word2vec . You would need to train it on some corpus of data or download a pre-trained model (for your specific language or set of languages).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM