![](/img/trans.png)
[英]Adding extra features to bag of words in scikit-learn pipeline with FeatureUnion
[英]How do I properly combine numerical features with text (bag of words) in scikit-learn?
我正在为网页编写一个分类器,所以我有一个混合的数字特征,我也想对文本进行分类。 我正在使用词袋方法将文本转换为(大)数字向量。 代码最终是这样的:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
numerical_features = [
[1, 0],
[1, 1],
[0, 0],
[0, 1]
]
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one',
'Is this the first document?',
]
bag_of_words_vectorizer = CountVectorizer(min_df=1)
X = bag_of_words_vectorizer.fit_transform(corpus)
words_counts = X.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(words_counts)
bag_of_words_vectorizer.get_feature_names()
combinedFeatures = np.hstack([numerical_features, tfidf.toarray()])
这有效,但我担心准确性。 请注意,有 4 个对象,并且只有两个数字特征。 即使是最简单的文本也会产生具有九个特征的向量(因为语料库中有九个不同的单词)。 显然,对于真实文本,会有数百或数千个不同的单词,因此最终的特征向量将是 < 10 个数字特征但基于 1000 个单词的特征向量。
因此,分类器 (SVM) 不会以 100 比 1 的系数对数字特征上的词进行大量加权吗? 如果是这样,我该如何补偿以确保词袋与数字特征的权重相等?
您可以使用Tf-idf对计数进行加权:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
np.set_printoptions(linewidth=200)
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one',
'Is this the first document?',
]
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names()
print(words)
words_counts = X.toarray()
print(words_counts)
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(words_counts)
print(tfidf.toarray())
输出是这样的:
# words
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']
# words_counts
[[0 1 1 1 0 0 1 0 1]
[0 1 0 1 0 2 1 0 1]
[1 0 0 0 1 0 1 1 0]
[0 1 1 1 0 0 1 0 1]]
# tfidf transformation
[[ 0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674]
[ 0. 0.27230147 0. 0.27230147 0. 0.85322574 0.22262429 0. 0.27230147]
[ 0.55280532 0. 0. 0. 0.55280532 0. 0.28847675 0.55280532 0. ]
[ 0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674]]
通过此表示,您应该能够合并其他二进制功能以训练SVC 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.