简体   繁体   English

scikit学习矢量化器词汇,具有映射到同一索引的多个术语

[英]scikit-learn vectorizer vocabulary with multiple terms mapping to same index

skikit-learn's TfidfVectorizer correctly maps vocabulary terms with the same dictionary value to the same index, however, it creates as many columns in the output as there are entries in the vocabulary dictionary. skikit-learn的TfidfVectorizer可以正确地将具有相同字典值的词汇词映射到相同的索引,但是,它在输出中创建的列数与词汇字典中的条目数一样多。 Is there a better way to get around this than to strip off the extra columns after the transformation? 有没有比转换后剥离多余的列更好的方法了? That is, in the example below, I don't want the third column because it will always be zero. 也就是说,在下面的示例中,我不希望第三列,因为它始终为零。

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(vocabulary={'surgery':0, 'sx':0, 'radiology':1})
text=['i had surgery','patient sx went well','radiology department']
vectorizer.fit(text)
vectorizer.transform(text).todense()

>>> matrix([[ 1.,  0.,  0.],
            [ 1.,  0.,  0.],
            [ 0.,  1.,  0.]])

A sklearn.feature_selection.VarianceThreshold (scikit-learn >= 0.15) will remove all-zero features (and constant features more generally). sklearn.feature_selection.VarianceThreshold (scikit-learn> = 0.15)将删除全零特征(更常见的是恒定特征)。

>>> X = np.array([[1, 0, 0], [1, 0, 0], [0, 1, 0]])
>>> VarianceThreshold().fit_transform(X)
array([[1, 0],
       [1, 0],
       [0, 1]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM