简体   繁体   中英

How to improve word mover distance similarity in python and provide similarity score using weighted sentence

Word movers Distance can be used to identify similarity between text . This similarity can be used to compare multiple text for finding nearest similar text. However , I was unable to customise the algorithm to do the following 1)eliminate location (GPE) - identified by spacy , in the text to have any weightage in comparing similarity . 2)Give more weightage to features that are in first sentence of text rather than features in second sentence and second sentence over third and so on .

instance = WmdSimilarity(wmd_corpus, loaded_model, num_best=10)
start = time()
sent = 'Abc hotel serves best in class drunken prawn in north america . ABC Hotel has branches in London, New York, Chicago and San Francisco.'
query = preprocess(sent)

sims = instance[query]  # A query is simply a "look-up" in the similarity class.

print('Cell took %.2f seconds to run.' % (time() - start))

print('Query:')
print(sent)
for i in range(num_best):
    print()
    print('sim = %.4f' % sims[i][1])
    print(documents[sims[i][0]])

In this particular example , where hotel description is passed for WMD similarity , The results identify descriptions such as

-DEF is a restaurant in Chicago serving vegan food since 1969 . -JKL now serving in London, New York, Chicago and San Francisco - Bestsellers of the hotel include drunken prawn , lasagne etc . (MNO Hotel)

Expected result Only MNO hotel from the above result is relevant accoring to the food aspect .

Query : How to eliminate the other hotel which are mapped due to location ?

The question is quite old, but I will try to answer it, because I am experiencing similar issues and I think that WMD is still one of STOA similarity metrics for text.

Are you sure you are using spacy? It looks like you are using the WMD by gensim . In any case, to answer your 2nd question: you can parse and split your company descriptions in sentences with any NLP library. Then, you can create embeddings for each sentence, instead of for each document (whole description) in your current implementation. Gensim is using Word2vec in the above link - you can also use this to get embeddings. And then, you can compare two companies based on their sentences and you could give a higher weight when for two companies the vectors of their first sentences match, slightly lower weight when their second sentences match, etc. I am not aware of any existing library that does this, but if you implement it on top of spacy, you will have the freedom to tune the weights and the logic as you wish.

Regarding your 1st question: I think the WMD should take into account the locations, because ABC is not only similar to MNO regarding the food, but also to JKL and DEF regarding their common location. Deleting the locations for the text will create bias in the similarity results.

If you want to do this though, you can just tokenize the text, remove the locations and then create the text string again: For instance:

# This can be part of your preprocessing function.
# You can apply it to all companies in advance.
import spacy
spacy_nlp = spacy.load('en_core_web_lg')
text = "Some hotel description"
doc = spacy_nlp(text)
current_tokens = [token.text for token in doc]
for item in doc:
   if item.ent_type_ == "the_type_to_be_removed":
     # remove word from `current_tokens` list
new_text = " ".join(current_tokens)
doc = spacy_nlp(new_text)
...
...

import wmd
spacy_nlp.add_pipe(wmd.WMD.SpacySimilarityHook(spacy_nlp), last=True)
doc_2 = spacy_nlp("Another hotel description")
print(doc1.similarity(doc_2))

This is the implementation of WMD integrated into spacy.

Since you are looking for nearest similar texts, you can check outthis example code. There is a nearest_neighbors function you can call to get the similar texts to a query text based on WMD.You can also define your own embeddings as you see in class SpacyEmbeddings , in case you create new embeddings that favor sentences specifically for your use case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM