如何使用深度学习找到句子相似度？

Question

I am trying to find sentence similarity through word emebeddings and then applying cosine similarity score.我试图通过词嵌入找到句子相似度，然后应用余弦相似度得分。 Tried CBOW/Skip Gram methods for embedding but did not solve the problem.尝试了 CBOW/Skip Gram 方法进行嵌入，但没有解决问题。

I am doing this for product review data.我这样做是为了产品评论数据。 I have two columns:我有两列：

SNo         Product_Title                                Customer_Review   
 1       101.x battery works well                    I have an Apple phone and it's not that
          with Samsung smart phone                     that great.

 2       112.x battery works well                     I have samsung smart tv and I tell that it's
         with Samsung smart phone                     not wort buying.

 3      112.x battery works well                      This charger works very well with samsung 
        with Samsung smart phone.                      phone. It is fast charging.

The first two reviews are irrelevant as semantic meaning of Product_Title and Customer_Review are completely different.前两个评论irrelevant因为Product_Title和Customer_Review语义完全不同。

How can an algorithm find this semantic meaning of sentences and score them.算法如何找到句子的这种语义并对其进行评分。

My Approach:我的方法：

Text pre-processing文本预处理
Train CBOW/Skip gram using Gensim on my data-set在我的数据集上使用 Gensim 训练 CBOW/Skip gram
Do Sentence level encoding via averaging all word vectors in that sentence通过平均该句子中的所有词向量来进行句子级别编码
Take cosine similarity of product_title and reviews .取余弦相似度product_title和reviews 。

Problem: It was not able to find the context from the sentence and hence the result was very poor.问题：无法从句子中找到上下文，因此结果很差。

Approch 2:方法二：

Used pre-trained BERT without pre-processing sentences.使用预训练的 BERT，无需预处理句子。 The result was not improving either.结果也没有改善。

1.Any other approach that would capture the context/semantics of sentences. 1.任何其他可以捕捉句子上下文/语义的方法。

2.How can we train BERT on our data-set from scratch without using pre-trained model? 2.如何在不使用预训练模型的情况下从头开始在我们的数据集上训练 BERT？

Answer 1

Have you tried the Universal Sentence Encoder (USE) , or the Multilingual Universal Sentence Encoder ?您是否尝试过通用句子编码器 (USE)或多语言通用句子编码器？

There's a colab showing how to score sentence pairs for semantic textual similarity with USE on the Semantic Textual Similarity Benchmark (STS-B) and another for multilingual similarity .有一个 colab 展示了如何在语义文本相似性基准 (STS-B) 上使用 USE对句子对的语义文本相似性进行评分，另一个用于多语言相似性。

Here's a heatmap of pairwise semantic similarity scores from USE on the Google AI blog post Advances in Semantic Textual Similarity .这是谷歌人工智能博客文章语义文本相似性进展上来自 USE 的成对语义相似性分数的热图。 The model was trained on a large amount of web data so it should work well for a wide variety of input data.该模型是在大量网络数据上训练的，因此它应该适用于各种输入数据。

通过 TensorFlow Hub Universal Sentence Encoder 的输出进行成对语义相似度比较。

Answer 2

Here is a very elaborate tutorial on how to perform sentence similarity analysis using any of the 50+ sentence Embeddings in NLU, like BERT, USE, Electra, and many more!这是一个非常详细的教程，介绍如何使用 NLU 中的 50 多个句子嵌入中的任何一个来执行句子相似性分析，例如 BERT、USE、Electra 等等！ NLU features over 50+ languages and includes multilingual embeddings! NLU 具有 50 多种语言，并包括多语言嵌入！
It takes around 5 lines to generate a similarity matrix with NLU and you can use 3 or more Sentence Embeddings at the same time in just 1 line of code, all you need is :使用 NLU 生成相似度矩阵大约需要 5 行，并且您可以在仅 1 行代码中同时使用 3 个或更多 Sentence Embeddings，您只需要：

nlu.load('embed_sentence.bert embed_sentence.electra use')

But let's keep it simple and let's say we want to calculate the similarity matrix for every sentence in our Dataframe但是让我们保持简单，假设我们要计算 Dataframe 中每个句子的相似度矩阵

You need the following 3 steps :您需要以下 3 个步骤：

1. Calculate embeddings 1. 计算嵌入

predictions = nlu.load('embed_sentence.bert').predict(your_dataframe)

2. Calculate the similarity matrix 2.计算相似度矩阵

def get_sim_df_total( predictions,e_col, string_to_embed,pipe=pipe):
  # This function calculates the distances between every sentence pair. Creates for ever sentence a new column with the name equal to the sentence it comparse to 
  # put embeddings in matrix
  embed_mat = np.array([x for x in predictions[e_col]])
  # calculate distance between every embedding pair
  sim_mat = cosine_similarity(embed_mat,embed_mat)
  # for i,v in enumerate(sim_mat): predictions[str(i)+'_sim'] = sim_mat[i]
  for i,v in enumerate(sim_mat): 
    s = predictions.iloc[i].document
    predictions[s] = sim_mat[i]

  return predictions 

sim_matrix_df = get_sim_df_total(predictions,'embed_sentence_bert_embeddings', 'How to get started with Machine Learning and Python' )
sim_matrix_df

3. Plot the heatmap of the similarity matrix 3.绘制相似度矩阵的热图

non_sim_columns  = ['text','document','Title','embed_sentence_bert_embeddings']

def viz_sim_matrix_first_n(num_sentences=20, sim_df = sim_matrix_df):
  # Plot heatmap for the first num_sentences
  fig, ax = plt.subplots(figsize=(20,14)) 
  sim_df.index = sim_df.document
  sim_columns = list(sim_df.columns)
  for b in non_sim_columns : sim_columns.remove(b)
  # sim_matrix_df[sim_columns]
  ax = sns.heatmap(sim_df.iloc[:num_sentences][sim_columns[:num_sentences]]) 

  ax.axes.set_title(f"Similarity matrix for the first {num_sentences} in the dataset",)

viz_sim_matrix_first_n()

To learn more checkout these links :)要了解更多信息，请查看这些链接:)

Article : https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf文章： https : //medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf

Colab Notebook for sentence similarity Demo with NLU : https://colab.research.google.com/drive/1LtOdtXtRJ3_N8kYywPd5k2AJMCGcgAdN?usp=sharing使用 NLU 进行句子相似度演示的 Colab Notebook： https ://colab.research.google.com/drive/1LtOdtXtRJ3_N8kYywPd5k2AJMCGcgAdN?usp=sharing

NLU Website : http://nlu.johnsnowlabs.com/ NLU 网站： http : //nlu.johnsnowlabs.com/

如何使用深度学习找到句子相似度？

问题描述

2 个解决方案

解决方案1
1 2020-03-13 18:00:35

解决方案2
1 2020-11-20 12:12:30

如何使用深度学习找到句子相似度？

问题描述

2 个解决方案

解决方案1 1 2020-03-13 18:00:35

解决方案2 1 2020-11-20 12:12:30

解决方案1
1 2020-03-13 18:00:35

解决方案2
1 2020-11-20 12:12:30