I want to ensure that the TfidfVectorizer object is returning a l2 normalized vector. I am running a binary classification problem with documents of varied length.
I am trying to extract the normalized vectors of each corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum is greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.
vect = TfidfVectorizer(strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)
The values of vect_sum are greater than 1, I thought using norm would result in all vectors to be between 0-1. I was just made aware of a preprocessing object in scikit learn - preprocessing.normalizer . Is that something I should use in the pipeline of Gridsearch? See example below.
pipeline = Pipeline([
('plb', normalize(tfidf, norm='l2')), #<-- sklearn.preprocessing
('tfidf', tfidf_vectorizer),
('clf', MultinomialNB()),
])
What is the difference between preprocessing.normalizer and Tfidfvectorizer norm parameter ?
With L2, it is not the sum of the rows that is equal to 1, but the sum of the squares is equal to 1. The L1 norm will produce a norm where the sum of the values equals 1.
X_train = [" This is my first sentence", "Short sentence"]
vect = TfidfVectorizer(strip_accents='unicode',analyzer='word', use_idf=True, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.multiply(tfidf).sum(axis=1)
vect_sum
# matrix([[ 1.],
# [ 1.]])
TF-IDF only applies to counts. You could achieve the same effect if you perform the normalize
after TF-IDF weights are produced.
vect = TfidfVectorizer(strip_accents='unicode',analyzer='word', use_idf=True, ngram_range=(1,2),
sublinear_tf= True , norm=None)
tfidf = vect.fit_transform(X_train)
tfidf = normalize(tfidf)
This would be equivalent to the TfidfVectorizer(..., norm='l2')
in the original example.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.