I have two lists, the idea is that each of the elements of one of the lists is compared with all the elements of the second, in order to extract the element with the greatest similarity. Like a search engine.
Variables used in NLU:
import numpy as np
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
match = ['sentence_to_1',
'sentence_to_2',
'sentence_to_3',
...]
match2 = ['sentence_from_1',
'sentence_from_2',
'sentence_from_3',
'sentence_from_1',
...]
pipe = nlu.load('xx.embed_sentence.bert_use_cmlm_multi_base_br')
df = pd.DataFrame({'one': match, 'two': match2})
predictions_1 = pipe.predict(df.one, output_level='document')
predictions_2 = pipe.predict(df.two, output_level='document')
e_col = 'sentence_embedding_bert_use_cmlm_multi_base_br'
predictions_1
output:
document sentence_embedding_bert_use_cmlm_multi_base_br
0 sentence_to_1 [0.018291207030415535, -0.05946089327335358, -...
1 sentence_to_2 [0.04855785518884659, 0.09505678713321686, 0.3...
2 sentence_to_3 [0.15838183462619781, -0.19057893753051758, -0...
I've already iterated each of the elements of one list to all the elements of another list this way. I would also really appreciate an idea that doesn't cost so much, avoiding a loop and going to list comprehension for example
embed_mat = np.array([x for x in predictions_1[e_col]])
for i in match2:
embedding = pipe.predict(i).iloc[0][e_col]
m = np.array([embedding,]*len(df))
sim_mat = cosine_similarity(m,embed_mat)
print(sim_mat[0])
output:
[0.66812827 0.60055647 0.7160895 0.730334 0.76885804 0.54169453
0.61199156 0.6578508 0.68869315 0.71536224 0.64135093 0.68568607
0.7026179 0.64319338 0.60390899 0.64774842 0.62665297 0.61611091
0.62738365 0.60333599 0.61464704 0.68141089 0.75263237 0.77213446
0.75132462]
[0.72350056 0.65223669 0.67931278 0.62036637 0.67934842 0.62129368
0.69825526 0.55635858 0.62417926 0.57909757 0.58463102 0.75053411
0.62435311 0.66574652 0.6980762 0.72050293 0.64668413 0.62632569
0.63648157 0.59476883 0.66401519 0.68794243 0.64723412 0.68215344
0.66456176]
[0.84471557 0.75666135 0.75268174 0.71671225 0.74120815 0.78075131
0.75810087 0.67278428 0.72912575 0.70120557 0.70225784 0.78829443
0.70072031 0.76282867 0.78521151 0.76517436 0.7233746 0.71423372
0.69281594 0.71363751 0.73811129 0.7231086 0.73386457 0.76077197
0.75507266]
...
Each element of this array is a level of similarity between one of the sentences and all other sentences in the second list.
The idea is that I have a final frame like this, where for each element that I search from a list, I find the element with the highest similarity in the second list.
element_from element_to similarity
0 sentence_from_1 sentence_to_5 0.95424...
1 sentence_from_3 sentence_to_10 0.93333...
2 sentence_from_11 sentence_to_12 0.55112...
Alternative solution that gives something similar:
# Cosine Similarity Calculation
def cosine_similarity(vector1, vector2):
vector1 = np.array(vector1)
vector2 = np.array(vector2)
return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2)))
for i in range(embed_mat.shape[0]):
for j in range(i + 1, embed_mat.shape[0]):
print("The cosine similarity between the documents ", i, "and", j, "is: ",
cosine_similarity(embed_mat.toarray()[i], embed_mat.toarray()[j]))
Output:
The cosine similarity between the documents sentence_from_1 and sentence_to_5 is 0.95424
I even managed to get the result doing it this way
embed_mat = np.array([x for x in predictions_1[e_col]])
to = []
fro = []
sim = []
for i in match2:
fro.append(i)
embedding = pipe.predict(i).iloc[0][e_col]
m = np.array([embedding,]*len(df))
sim_mat = cosine_similarity(m,embed_mat)
sim.append(max(sim_mat[0]))
to.append(predictions_1['document'].values[sim_mat[0].argmax()])
pd.DataFrame({'From': fro, 'To': to, 'Similarity': sim})
But I think there are better ways to solve it. And better I say more optimized.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.