Tfidvectorizer - L2归一化向量

Question

I want to ensure that the TfidfVectorizer object is returning a l2 normalized vector. 我想确保TfidfVectorizer对象返回一个l2规范化的向量。 I am running a binary classification problem with documents of varied length. 我正在使用不同长度的文档运行二进制分类问题。

I am trying to extract the normalized vectors of each corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. 我试图提取每个语料库的规范化向量，所以我假设我可以总结Tfidfvectorizer矩阵的每一行。 However the sum is greater than 1, I thought a normalized copora would transform all documents to a range between 0-1. 然而总和大于1，我认为标准化的copora会将所有文档转换为0-1之间的范围。

vect = TfidfVectorizer(strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')

tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)

The values of vect_sum are greater than 1, I thought using norm would result in all vectors to be between 0-1. vect_sum的值大于1，我认为使用norm会导致所有向量都在0-1之间。 I was just made aware of a preprocessing object in scikit learn - preprocessing.normalizer . 我刚刚在scikit learn - preprocessing.normalizer中意识到了一个预处理对象。 Is that something I should use in the pipeline of Gridsearch? 这是我应该在Gridsearch的管道中使用的东西吗？ See example below. 见下面的例子。

pipeline = Pipeline([
    ('plb', normalize(tfidf, norm='l2')), #<-- sklearn.preprocessing
    ('tfidf', tfidf_vectorizer),
    ('clf', MultinomialNB()),  
])

What is the difference between preprocessing.normalizer and Tfidfvectorizer norm parameter ? preprocessing.normalizer和Tfidfvectorizer规范参数之间有什么区别？

Answer 1

With L2, it is not the sum of the rows that is equal to 1, but the sum of the squares is equal to 1. The L1 norm will produce a norm where the sum of the values equals 1. 对于L2，它不是行的总和等于1，但是正方形的总和等于1.L1规范将产生一个范数，其中值的总和等于1。

X_train = [" This is my first sentence", "Short sentence"]
vect = TfidfVectorizer(strip_accents='unicode',analyzer='word', use_idf=True, ngram_range=(1,2),sublinear_tf= True , norm='l2')

tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.multiply(tfidf).sum(axis=1)
vect_sum

# matrix([[ 1.],
#         [ 1.]])

TF-IDF only applies to counts. TF-IDF仅适用于计数。 You could achieve the same effect if you perform the normalize after TF-IDF weights are produced. 如果在生成TF-IDF权重后执行normalize ，则可以获得相同的效果。

vect = TfidfVectorizer(strip_accents='unicode',analyzer='word', use_idf=True, ngram_range=(1,2),
                       sublinear_tf= True , norm=None)

tfidf = vect.fit_transform(X_train)
tfidf = normalize(tfidf)

This would be equivalent to the TfidfVectorizer(..., norm='l2') in the original example. 这相当于原始示例中的TfidfVectorizer(..., norm='l2') 。

Tfidvectorizer - L2归一化向量

问题描述

1 个解决方案

解决方案1
3 2016-01-31 22:49:43

Tfidvectorizer - L2归一化向量

问题描述

1 个解决方案

解决方案1 3 2016-01-31 22:49:43

解决方案1
3 2016-01-31 22:49:43