简体   繁体   中英

Matching phrase using TF-IDF and cosine similarity

I have a dataframe that looks like this:

question                                answer
Why did the chicken cross the road?     to get to the other side
Who are you?                            a chatbot
Hello, how are you?                     Hi
.
.
.  

What I'd like to do is use TF-IDF to train on this dataset. When the user enters a phrase, the question that matches the phrase the most will be chosen using cosine similarity. I am able to create the TF-IDF values this way for the sentences on the train dataset, but how do I come up with using this to find the cosine similarity score on the new phrase the user inputs?

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(intent_data["sentence"])

I think you need something like

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()

Try this:

Input:

question    answer
0   Why did the chicken cross the road? to get to the other side
1   Who are you?    a chatbot
2   Hello, how are you? Hi

#Script

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]

Suggestion: Use some prediction based word embedding approach to maintain the context in the output vector, will get more accurate results in case of ambiguous sentences. (eg: fasttext, word2vec).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM