简体   繁体   中英

How to get maximum similarity value between lists with numpy?

I have two lists, the idea is that each of the elements of one of the lists is compared with all the elements of the second, in order to extract the element with the greatest similarity. Like a search engine.

Variables used in NLU:

import numpy as np
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

match = ['sentence_to_1',
 'sentence_to_2',
 'sentence_to_3',
  ...]

match2 = ['sentence_from_1',
 'sentence_from_2',
 'sentence_from_3',
 'sentence_from_1',
  ...]

pipe = nlu.load('xx.embed_sentence.bert_use_cmlm_multi_base_br')
df = pd.DataFrame({'one': match, 'two': match2})
predictions_1 = pipe.predict(df.one, output_level='document')
predictions_2 = pipe.predict(df.two, output_level='document')
e_col = 'sentence_embedding_bert_use_cmlm_multi_base_br'

predictions_1
output: 

  document          sentence_embedding_bert_use_cmlm_multi_base_br
0 sentence_to_1     [0.018291207030415535, -0.05946089327335358, -...
1 sentence_to_2     [0.04855785518884659, 0.09505678713321686, 0.3...
2 sentence_to_3     [0.15838183462619781, -0.19057893753051758, -0...

I've already iterated each of the elements of one list to all the elements of another list this way. I would also really appreciate an idea that doesn't cost so much, avoiding a loop and going to list comprehension for example

embed_mat = np.array([x for x in predictions_1[e_col]])
for i in match2:
  embedding = pipe.predict(i).iloc[0][e_col]
  m = np.array([embedding,]*len(df))
  sim_mat = cosine_similarity(m,embed_mat)
  print(sim_mat[0])
output:

[0.66812827 0.60055647 0.7160895  0.730334   0.76885804 0.54169453
 0.61199156 0.6578508  0.68869315 0.71536224 0.64135093 0.68568607
 0.7026179  0.64319338 0.60390899 0.64774842 0.62665297 0.61611091
 0.62738365 0.60333599 0.61464704 0.68141089 0.75263237 0.77213446
 0.75132462]
[0.72350056 0.65223669 0.67931278 0.62036637 0.67934842 0.62129368
 0.69825526 0.55635858 0.62417926 0.57909757 0.58463102 0.75053411
 0.62435311 0.66574652 0.6980762  0.72050293 0.64668413 0.62632569
 0.63648157 0.59476883 0.66401519 0.68794243 0.64723412 0.68215344
 0.66456176]
[0.84471557 0.75666135 0.75268174 0.71671225 0.74120815 0.78075131
 0.75810087 0.67278428 0.72912575 0.70120557 0.70225784 0.78829443
 0.70072031 0.76282867 0.78521151 0.76517436 0.7233746  0.71423372
 0.69281594 0.71363751 0.73811129 0.7231086  0.73386457 0.76077197
 0.75507266]
...

Each element of this array is a level of similarity between one of the sentences and all other sentences in the second list.

The idea is that I have a final frame like this, where for each element that I search from a list, I find the element with the highest similarity in the second list.

  element_from       element_to       similarity
0 sentence_from_1    sentence_to_5    0.95424...
1 sentence_from_3    sentence_to_10   0.93333...
2 sentence_from_11   sentence_to_12   0.55112...

Alternative solution that gives something similar:

# Cosine Similarity Calculation
def cosine_similarity(vector1, vector2):
    vector1 = np.array(vector1)
    vector2 = np.array(vector2)
    return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2))) 

for i in range(embed_mat.shape[0]):
    for j in range(i + 1, embed_mat.shape[0]):
        print("The cosine similarity between the documents ", i, "and", j, "is: ",
              cosine_similarity(embed_mat.toarray()[i], embed_mat.toarray()[j]))
Output:
The cosine similarity between the documents sentence_from_1 and sentence_to_5 is   0.95424

I even managed to get the result doing it this way

embed_mat = np.array([x for x in predictions_1[e_col]])
to = []
fro = []
sim = []
for i in match2:
  fro.append(i)
  embedding = pipe.predict(i).iloc[0][e_col]
  m = np.array([embedding,]*len(df))
  sim_mat = cosine_similarity(m,embed_mat)
  sim.append(max(sim_mat[0]))
  to.append(predictions_1['document'].values[sim_mat[0].argmax()])

pd.DataFrame({'From': fro, 'To': to, 'Similarity': sim})

But I think there are better ways to solve it. And better I say more optimized.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM