简体   繁体   中英

search engine with Tf-Idf in python

here is my code

 from sklearn.feature_extraction.text import TfidfVectorizer
 corpus = [
     "this is first document ","this is second document","this is third","which document is first", ]

 vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

X.toarray()

now this is what i want to do?

when i search document it should give me [ 1,2,4]documents(sentence)

when i search first document it should give me [1]documents(sentence)

when i search second it should give me [2]documents(sentence)

i want to do this with TfIdf (i can't do normal searching )

how can i do that?

First of all, you have to ask yourself the question: what does the TfidfVectorizer do? The answer is: it transforms your documents into vectors. How can you proceed further? One solution is to transform your query also into a vector by using the vectorizer. Then, you can compare the cosine similarity between the transformed query vector and each of the vectors of the documents in your database. The document with the highest cosine similarity to your query vector is the most relevant one (at least according to the Vector space model). Here https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089 is an example implementation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM