简体   繁体   中英

Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.

So far I have calculated the tf-idf of the documents doing the following:

from sklearn.feature_extraction.text import TfidfVectorizer

def get_term_frequency_inverse_data_frequency(documents):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(allDocs)
    return matrix

def get_tf_idf_query_similarity(documents, query):
    tfidf = get_term_frequency_inverse_data_frequency(documents)

The problem I am having is now that I have tf-idf of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?

Here is my suggestion:

  • We don't have to fit the model twice. we could reuse the same vectorizer
  • text cleaning function can be plugged into TfidfVectorizer directly using preprocessing attribute.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)

def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    return cosineSimilarities

Cosine similarity is cosine of the angle between the vectors that represent documents.

K(X, Y) = <X, Y> / (||X||*||Y||)

Your tf-idf matrix will be a sparse matrix with dimensions = no. of documents * no. of distinct words.

To print the whole matrix you can use todense()

print(tfidf.todense())

Each row represents the vector representation corresponding to one document. Like wise each column corresponds to tf-idf score of unique word in the corpus.

Between a vector and any other vector the pairwise-similarity can be calculated from your tf-idf matrix as:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(reference_vector, tfidf_matrix) 

The output will be a array of length = no. of documents indicating the similarity score between your reference vector and vector corresponding to each document. Of course the similarity between the reference vector and itself will be 1. Overall it will be a value between 0 and 1.

To find the similarity between first and second documents,

print(cosine_similarity(tfidf_matrix[0], tfidf_matrix[1]))

array([[0.36651513]])

You can do as Nihal has written in his response or you can use the nearest neighbors algo from sklearn. You have to select the proper metric (cosine)

from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='cosine')

The other answers were very helpful but not entirely what I was looking for as they didn't help me transform my query so I could compare it with the documents.

To transform the query I first fit it to the document matrix:

queryTFIDF = TfidfVectorizer().fit(allDocs)

I then transform it into the matrix shape:

queryTFIDF = queryTFIDF.transform([query])

And then just calculate the cosine similarity between all the documents and my query using the sklearn.metrics.pairwise.cosine_similarity function

cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()

Although I realise using Nihal's solution I could input my query as one of the documents and then calculated the similarity between it and the other documents but this is what worked best for me.

The full code ends up looking like:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def get_tf_idf_query_similarity(documents, query):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    docTFIDF = TfidfVectorizer().fit_transform(allDocs)
    queryTFIDF = TfidfVectorizer().fit(allDocs)
    queryTFIDF = queryTFIDF.transform([query])

    cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()
    return cosineSimilarities

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM