简体   繁体   中英

Finding Similarity between 2 sentences using word2vec of sentence with python

I want to calculate the similarity between two sentences using word2vectors, I am trying to get the vectors of a sentence so that i can calculate the average of a sentence vectors to find the cosine similarity. i have tried this code but its not working. the output it gives the sentence-vectors with ones. i want the actual vectors of sentences in sentence_1_avg_vector & sentence_2_avg_vector.

Code:

    #DataSet#
    sent1=[['What', 'step', 'step', 'guide', 'invest', 'share', 'market', 'india'],['What', 'story', 'Kohinoor', 'KohiNoor', 'Diamond']]
    sent2=[['What', 'step', 'step', 'guide', 'invest', 'share', 'market'],['What', 'would', 'happen', 'Indian', 'government', 'stole', 'Kohinoor', 'KohiNoor', 'diamond', 'back']]
    sentences=sent1+sent2

    #''''Applying Word2vec''''#
    word2vec_model=gensim.models.Word2Vec(sentences, size=100, min_count=5)
    bin_file="vecmodel.csv"
    word2vec_model.wv.save_word2vec_format(bin_file,binary=False)

    #''''Making Sentence Vectors''''#
    def avg_feature_vector(words, model, num_features, index2word_set):
        #function to average all words vectors in a given paragraph
        featureVec = np.ones((num_features,), dtype="float32")
        #print(featureVec)
        nwords = 0
        #list containing names of words in the vocabulary
        index2word_set = set(model.wv.index2word)# this is moved as input param for performance reasons
        for word in words:
            if word in index2word_set:
                nwords = nwords+1
                featureVec = np.add(featureVec, model[word])
                print(featureVec)
        if(nwords>0):
            featureVec = np.divide(featureVec, nwords)
        return featureVec

    i=0
    while i<len(sent1):
        sentence_1_avg_vector = avg_feature_vector(mylist1, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word))
        print(sentence_1_avg_vector)

        sentence_2_avg_vector = avg_feature_vector(mylist2, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word))
        print(sentence_2_avg_vector)

        sen1_sen2_similarity =  1 - spatial.distance.cosine(sentence_1_avg_vector,sentence_2_avg_vector)
        print(sen1_sen2_similarity)

        i+=1

the output this code gives:

[ 1.  1.  ....  1.  1.]
[ 1.  1.  ....  1.  1.]
0.999999898245
[ 1.  1.  ....  1.  1.]
[ 1.  1.  ....  1.  1.]
0.999999898245

I think what you are trying to achieve is the following:

  1. Obtain vector representations from word2vec for every word in your sentence.
  2. Average all word vectors of a sentence to obtain a sentence representation.
  3. Compute cosine similarity between the vectors of two sentences.

While the code for 2 and 3 looks fine to me in general (haven't tested it though), the issue is probably in step 1. What you are doing in your code with

word2vec_model=gensim.models.Word2Vec(sentences, size=100, min_count=5)

is to initialize a new word2vec model. If you would then call word2vec_model.train() , gensim would train a new model on your sentences so you can use the resulting vectors for each word afterwards. But, in order to obtain useful word vectors that capture things like similarity, you usually need to train the word2vec model on a lot of data - the model provided by Google was trained on 100 billion words.

What you probably want to do instead is to use a pretrained word2vec model and use it with gensim in your code. According to the documentation of gensim , this can be done with the KeyedVectors.load_word2vec_format method.

Your 2nd section (converting text into feature vectors) is wrong. You have to replace:

featureVec = np.ones((num_features,), dtype="float32")

with

featureVec = np.zeros((num_features,), dtype="float32") .

If none of the words were found in the dictionary (index2word_set), then it should give them all zeros. That solved my issue. 😌 🌟

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM