使用 TF-IDF 和余弦相似度匹配短語

Question

我有一個看起來像這樣的 dataframe：

question                                answer
Why did the chicken cross the road?     to get to the other side
Who are you?                            a chatbot
Hello, how are you?                     Hi
.
.
.

我想做的是使用 TF-IDF 在這個數據集上進行訓練。 當用戶輸入一個短語時，將使用余弦相似度選擇與該短語最匹配的問題。 我能夠以這種方式為訓練數據集上的句子創建 TF-IDF 值，但是我如何想出使用它來查找用戶輸入的新短語的余弦相似度分數？

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(intent_data["sentence"])

Answer 1

我認為你需要類似的東西

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()

Answer 2

嘗試這個：

輸入：

question    answer
0   Why did the chicken cross the road? to get to the other side
1   Who are you?    a chatbot
2   Hello, how are you? Hi

#Script

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]

建議：使用一些基於預測的詞嵌入方法來維護 output 向量中的上下文，在句子歧義的情況下會得到更准確的結果。 （例如：fasttext、word2vec）。

使用 TF-IDF 和余弦相似度匹配短語

問題描述

2 個解決方案

解決方案1
1 已采納 2019-10-04 17:19:33

解決方案2
0 2019-10-04 17:51:08

使用 TF-IDF 和余弦相似度匹配短語

問題描述

2 個解決方案

解決方案1 1 已采納 2019-10-04 17:19:33

解決方案2 0 2019-10-04 17:51:08

解決方案1
1 已采納 2019-10-04 17:19:33

解決方案2
0 2019-10-04 17:51:08