(TF-IDF)计算余弦相似度后如何返回五篇相关文章

Question

I get a dataframe sample_df (4 columns: paper_id , title , abstract , body_text ).我得到一个 dataframe sample_df （4 列： paper_id 、 title 、 abstract 、 body_text ）。 I extracted the abstract column(~1000 words per abstract) and apply the text cleaning process.我提取了摘要列（每个摘要约 1000 个字）并应用了文本清理过程。 Here's my question:这是我的问题：

After finished calculating the cosine similarity between question and abstract, how can it return the top5 articles score with corresponding information(eg paper_id , title , body_text ) since my goal is to do tf -idf question answering.在计算完问题和摘要之间的余弦相似度后，由于我的目标是做 tf -idf 问答，它如何返回 top5 文章分数以及相应的信息（例如paper_id 、 title 、 body_text ）。

I'm really sorry that my english is poor and I am new to nlp.我真的很抱歉我的英语很差，而且我是 nlp 的新手。 I would appreciated if someone can help.如果有人可以提供帮助，我将不胜感激。

from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity  

txt_cleaned = get_cleaned_text(sample_df,sample_df['abstract'])
question = ['Can covid19 transmit through air']

tfidf_vector = TfidfVectorizer()

tfidf = tfidf_vector.fit_transform(txt_cleaned)

tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()

related_docs_indices = cosine_similarities.argsort()[:-5:-1]
cosine_similarities[related_docs_indices]

#output([0.18986527, 0.18339485, 0.14951123, 0.13441914])

Answer 1

First: if you want 5 articles then instead of [:-5:-1] you have to use [:-6:-1] because for negative values it works little different.首先：如果你想要 5 篇文章，那么你必须使用[:-6:-1]而不是[:-5:-1] ]，因为对于负值，它的工作方式几乎没有什么不同。

Or use [::-1][:5] - [::-1] will reverse all values and then you can use normal [:5]或使用[::-1][:5] - [::-1]将反转所有值，然后您可以使用普通[:5]

When you have related_docs_indices then you can use .iloc[] to get elements from DataFrame当您拥有related_docs_indices时，您可以使用.iloc[]从DataFrame获取元素

 sample_df.iloc[ related_docs_indices ]

If you will have elements with the same similarity then it will gives them in reversed order.如果您将拥有具有相同相似性的元素，那么它将以相反的顺序给出它们。

BTW:顺便提一句：

You can also add similarities to DataFrame您还可以添加与DataFrame的similarities

sample_df['similarity'] = cosine_similarities

and then sort (reversed) and get 5 items.然后排序（反转）并得到5个项目。

sample_df.sort_values('similarity', ascending=False)[:5]

If you will have elements with the same similarity then it will gives them in original order.如果您将拥有具有相同相似性的元素，那么它将按原始顺序提供它们。

Minimal working code with some data - so everyone can copy and test it.包含一些数据的最小工作代码 - 所以每个人都可以复制和测试它。

Because I have only 5 elements in DataFrame so I search 2 elements.因为我在DataFrame中只有 5 个元素，所以我搜索了 2 个元素。

from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity  

import pandas as pd

sample_df = pd.DataFrame({
    'paper_id': [1, 2, 3, 4, 5],
    'title': ['Covid19', 'Flu', 'Cancer', 'Covid19 Again', 'New Air Conditioners'],
    'abstract': ['covid19', 'flu', 'cancer', 'covid19', 'air conditioner'],
    'body_text': ['Hello covid19', 'Hello flu', 'Hello cancer', 'Hello covid19 again', 'Buy new air conditioner'],
})

def get_cleaned_text(df, row):
    return row

txt_cleaned = get_cleaned_text(sample_df, sample_df['abstract'])
question = ['Can covid19 transmit through air']

tfidf_vector = TfidfVectorizer()

tfidf = tfidf_vector.fit_transform(txt_cleaned)

tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()

sample_df['similarity'] = cosine_similarities

number = 2
#related_docs_indices = cosine_similarities.argsort()[:-(number+1):-1]
related_docs_indices = cosine_similarities.argsort()[::-1][:number]

print('index:', related_docs_indices)

print('similarity:', cosine_similarities[related_docs_indices])

print('\n--- related_docs_indices ---\n')

print(sample_df.iloc[related_docs_indices])

print('\n--- sort_values ---\n')

print( sample_df.sort_values('similarity', ascending=False)[:number] )

Result:结果：

index: [3 0]
similarity: [0.62791376 0.62791376]

--- related_docs_indices ---

   paper_id          title abstract            body_text  similarity
3         4  Covid19 Again  covid19  Hello covid19 again    0.627914
0         1        Covid19  covid19        Hello covid19    0.627914

--- sort_values ---

   paper_id          title abstract            body_text  similarity
0         1        Covid19  covid19        Hello covid19    0.627914
3         4  Covid19 Again  covid19  Hello covid19 again    0.627914

(TF-IDF)计算余弦相似度后如何返回五篇相关文章

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-08-17 13:59:22

(TF-IDF)计算余弦相似度后如何返回五篇相关文章

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-08-17 13:59:22

解决方案1
0 已采纳 2020-08-17 13:59:22