简体   繁体   中英

Sklearn and gensim's TF-IDF implementation

I've been trying to determine the similarity between a set of documents, and one of the methods I'm using is the cosine similarity with the results of the TF-IDF.

I tried to use both sklearn and gensim's implementations, which give me similar results, but my own implementation results in a different matrix.

After analyzing, I noticed that the their implementations are different from the ones I've studied and came across:

Sklearn and gensim use raw counts as the TF, and apply L2 norm on the resulting vectors.

On the other side, the implementations I found will normalize the term count, like

TF = term count / sum of all term counts in the document

My question is, what is the difference with their implementations? Do they give better results in the end, for clustering or other purposes?

EDIT(So the question is clearer): What is the difference between normalizing the end result vs normalizing the term count at the beggining?

I ended up understanding why the normalization is done at the end of the tf-idf calculations instead of doing it on the term frequencies.

After searching around, I noticed they use L2 normalization in order to facilitate cosine similarity calculations.

So, instead of using the formula dot(vector1, vector2) / (norm(vector1) * norm(vector2)) to get the similarity between 2 vectors, we can use directly the result from the fit_transform function: tfidf * tfidf.T , without the need to normalize, since the norm for the vectors is already 1.

I tried to add normalization on the term frequency, but it just gives out the same results in the end, when normalizing the whole vectors, ending up being a waste of time.

With scikit-learn, you can set the normalization as desired when calling TfidfTransformer() by setting norm to either l1 , l2 , or none .

If you try this with none , you may get similar results to your own hand-rolled tf-idf implementation.

The normalization is typically used to reduce the effects of document length on a particular tf-idf weighting so that words appearing in short documents are treated on more equal footing to words appearing in much longer documents.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM