scikit学习矢量化器词汇，具有映射到同一索引的多个术语

Question

skikit-learn's TfidfVectorizer correctly maps vocabulary terms with the same dictionary value to the same index, however, it creates as many columns in the output as there are entries in the vocabulary dictionary. skikit-learn的TfidfVectorizer可以正确地将具有相同字典值的词汇词映射到相同的索引，但是，它在输出中创建的列数与词汇字典中的条目数一样多。 Is there a better way to get around this than to strip off the extra columns after the transformation? 有没有比转换后剥离多余的列更好的方法了？ That is, in the example below, I don't want the third column because it will always be zero. 也就是说，在下面的示例中，我不希望第三列，因为它始终为零。

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(vocabulary={'surgery':0, 'sx':0, 'radiology':1})
text=['i had surgery','patient sx went well','radiology department']
vectorizer.fit(text)
vectorizer.transform(text).todense()

>>> matrix([[ 1.,  0.,  0.],
            [ 1.,  0.,  0.],
            [ 0.,  1.,  0.]])

Answer 1

A sklearn.feature_selection.VarianceThreshold (scikit-learn >= 0.15) will remove all-zero features (and constant features more generally). sklearn.feature_selection.VarianceThreshold （scikit-learn> = 0.15）将删除全零特征（更常见的是恒定特征）。

>>> X = np.array([[1, 0, 0], [1, 0, 0], [0, 1, 0]])
>>> VarianceThreshold().fit_transform(X)
array([[1, 0],
       [1, 0],
       [0, 1]])

scikit学习矢量化器词汇，具有映射到同一索引的多个术语

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-10-03 09:15:08

scikit学习矢量化器词汇，具有映射到同一索引的多个术语

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-10-03 09:15:08

解决方案1
1 已采纳 2014-10-03 09:15:08