简体   繁体   English

如何使用深度学习找到句子相似度?

[英]How to find Sentence Similarity using deep learning?

I am trying to find sentence similarity through word emebeddings and then applying cosine similarity score.我试图通过词嵌入找到句子相似度,然后应用余弦相似度得分。 Tried CBOW/Skip Gram methods for embedding but did not solve the problem.尝试了 CBOW/Skip Gram 方法进行嵌入,但没有解决问题。

I am doing this for product review data.我这样做是为了产品评论数据。 I have two columns:我有两列:

SNo         Product_Title                                Customer_Review   
 1       101.x battery works well                    I have an Apple phone and it's not that
          with Samsung smart phone                     that great.

 2       112.x battery works well                     I have samsung smart tv and I tell that it's
         with Samsung smart phone                     not wort buying.

 3      112.x battery works well                      This charger works very well with samsung 
        with Samsung smart phone.                      phone. It is fast charging.

The first two reviews are irrelevant as semantic meaning of Product_Title and Customer_Review are completely different.前两个评论irrelevant因为Product_TitleCustomer_Review语义完全不同。

How can an algorithm find this semantic meaning of sentences and score them.算法如何找到句子的这种语义并对其进行评分。

My Approach:我的方法:

  1. Text pre-processing文本预处理

  2. Train CBOW/Skip gram using Gensim on my data-set在我的数据集上使用 Gensim 训练 CBOW/Skip gram

  3. Do Sentence level encoding via averaging all word vectors in that sentence通过平均该句子中的所有词向量来进行句子级别编码

  4. Take cosine similarity of product_title and reviews .取余弦相似度product_titlereviews

Problem: It was not able to find the context from the sentence and hence the result was very poor.问题:无法从句子中找到上下文,因此结果很差。

Approch 2:方法二:

Used pre-trained BERT without pre-processing sentences.使用预训练的 BERT,无需预处理句子。 The result was not improving either.结果也没有改善。

1.Any other approach that would capture the context/semantics of sentences. 1.任何其他可以捕捉句子上下文/语义的方法。

2.How can we train BERT on our data-set from scratch without using pre-trained model? 2.如何在不使用预训练模型的情况下从头开始在我们的数据集上训练 BERT?

Have you tried the Universal Sentence Encoder (USE) , or the Multilingual Universal Sentence Encoder ?您是否尝试过通用句子编码器 (USE)或多语言通用句子编码器

There's a colab showing how to score sentence pairs for semantic textual similarity with USE on the Semantic Textual Similarity Benchmark (STS-B) and another for multilingual similarity .有一个 colab 展示了如何在语义文本相似性基准 (STS-B) 上使用 USE对句子对的语义文本相似性进行评分,另一个用于多语言相似性

Here's a heatmap of pairwise semantic similarity scores from USE on the Google AI blog post Advances in Semantic Textual Similarity .这是谷歌人工智能博客文章语义文本相似性进展上来自 USE 的成对语义相似性分数的热图。 The model was trained on a large amount of web data so it should work well for a wide variety of input data.该模型是在大量网络数据上训练的,因此它应该适用于各种输入数据。

通过 TensorFlow Hub Universal Sentence Encoder 的输出进行成对语义相似度比较。

在此处输入图片说明

Here is a very elaborate tutorial on how to perform sentence similarity analysis using any of the 50+ sentence Embeddings in NLU, like BERT, USE, Electra, and many more!这是一个非常详细的教程,介绍如何使用 NLU 中的 50 多个句子嵌入中的任何一个来执行句子相似性分析,例如 BERT、USE、Electra 等等! NLU features over 50+ languages and includes multilingual embeddings! NLU 具有 50 多种语言,并包括多语言嵌入!
It takes around 5 lines to generate a similarity matrix with NLU and you can use 3 or more Sentence Embeddings at the same time in just 1 line of code, all you need is :使用 NLU 生成相似度矩阵大约需要 5 行,并且您可以在仅 1 行代码中同时使用 3 个或更多 Sentence Embeddings,您只需要:

nlu.load('embed_sentence.bert embed_sentence.electra use')

But let's keep it simple and let's say we want to calculate the similarity matrix for every sentence in our Dataframe但是让我们保持简单,假设我们要计算 Dataframe 中每个句子的相似度矩阵

You need the following 3 steps :您需要以下 3 个步骤:

1. Calculate embeddings 1. 计算嵌入

predictions = nlu.load('embed_sentence.bert').predict(your_dataframe)

2. Calculate the similarity matrix 2.计算相似度矩阵

def get_sim_df_total( predictions,e_col, string_to_embed,pipe=pipe):
  # This function calculates the distances between every sentence pair. Creates for ever sentence a new column with the name equal to the sentence it comparse to 
  # put embeddings in matrix
  embed_mat = np.array([x for x in predictions[e_col]])
  # calculate distance between every embedding pair
  sim_mat = cosine_similarity(embed_mat,embed_mat)
  # for i,v in enumerate(sim_mat): predictions[str(i)+'_sim'] = sim_mat[i]
  for i,v in enumerate(sim_mat): 
    s = predictions.iloc[i].document
    predictions[s] = sim_mat[i]

  return predictions 

sim_matrix_df = get_sim_df_total(predictions,'embed_sentence_bert_embeddings', 'How to get started with Machine Learning and Python' )
sim_matrix_df

3. Plot the heatmap of the similarity matrix 3.绘制相似度矩阵的热图

non_sim_columns  = ['text','document','Title','embed_sentence_bert_embeddings']

def viz_sim_matrix_first_n(num_sentences=20, sim_df = sim_matrix_df):
  # Plot heatmap for the first num_sentences
  fig, ax = plt.subplots(figsize=(20,14)) 
  sim_df.index = sim_df.document
  sim_columns = list(sim_df.columns)
  for b in non_sim_columns : sim_columns.remove(b)
  # sim_matrix_df[sim_columns]
  ax = sns.heatmap(sim_df.iloc[:num_sentences][sim_columns[:num_sentences]]) 

  ax.axes.set_title(f"Similarity matrix for the first {num_sentences} in the dataset",)

viz_sim_matrix_first_n()

在此处输入图片说明

To learn more checkout these links :)要了解更多信息,请查看这些链接:)

Article : https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf文章: https : //medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf

Colab Notebook for sentence similarity Demo with NLU : https://colab.research.google.com/drive/1LtOdtXtRJ3_N8kYywPd5k2AJMCGcgAdN?usp=sharing使用 NLU 进行句子相似度演示的 Colab Notebook: https ://colab.research.google.com/drive/1LtOdtXtRJ3_N8kYywPd5k2AJMCGcgAdN?usp=sharing

NLU Website : http://nlu.johnsnowlabs.com/ NLU 网站: http : //nlu.johnsnowlabs.com/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM