如何在 scikit-learn 中正确地将数字特征与文本（词袋）结合起来？

Question

我正在为网页编写一个分类器，所以我有一个混合的数字特征，我也想对文本进行分类。 我正在使用词袋方法将文本转换为（大）数字向量。 代码最终是这样的：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

numerical_features = [
  [1, 0],
  [1, 1],
  [0, 0],
  [0, 1]
]
corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one',
  'Is this the first document?',
]
bag_of_words_vectorizer = CountVectorizer(min_df=1)
X = bag_of_words_vectorizer.fit_transform(corpus)
words_counts = X.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(words_counts)

bag_of_words_vectorizer.get_feature_names()
combinedFeatures = np.hstack([numerical_features, tfidf.toarray()])

这有效，但我担心准确性。 请注意，有 4 个对象，并且只有两个数字特征。 即使是最简单的文本也会产生具有九个特征的向量（因为语料库中有九个不同的单词）。 显然，对于真实文本，会有数百或数千个不同的单词，因此最终的特征向量将是 < 10 个数字特征但基于 1000 个单词的特征向量。

因此，分类器 (SVM) 不会以 100 比 1 的系数对数字特征上的词进行大量加权吗？ 如果是这样，我该如何补偿以确保词袋与数字特征的权重相等？

Answer 1

我认为您的担忧是完全有效的，即以天真的方式（作为多热向量）从稀疏文本标记产生的显着更高的维度。 你至少可以用下面的两种方法来解决这个问题。 它们都会从文本中生成一个低维向量（例如，100 维）。 当你的词汇量增加时，维度不会增加。

带有特征散列。 这适用于您的词袋模型。
使用词嵌入（与 scikit-learn 一起使用的示例用法）或更高级的文本编码器，例如通用句子编码器或最先进的BERT 编码器的任何变体。

Answer 2

您可以使用Tf-idf对计数进行加权：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

np.set_printoptions(linewidth=200)

corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one',
  'Is this the first document?',
]

vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)

words = vectorizer.get_feature_names()
print(words)
words_counts = X.toarray()
print(words_counts)

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(words_counts)
print(tfidf.toarray())

输出是这样的：

# words
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']

# words_counts
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]

# tfidf transformation
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.          0.35872874  0.          0.43877674]
 [ 0.          0.27230147  0.          0.27230147  0.          0.85322574  0.22262429  0.          0.27230147]
 [ 0.55280532  0.          0.          0.          0.55280532  0.          0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.          0.35872874  0.          0.43877674]]

通过此表示，您应该能够合并其他二进制功能以训练SVC 。

如何在 scikit-learn 中正确地将数字特征与文本（词袋）结合起来？

问题描述

1 个解决方案

解决方案1
1 2020-06-18 03:12:27

解决方案2
-2 2016-09-12 08:19:35

如何在 scikit-learn 中正确地将数字特征与文本（词袋）结合起来？

问题描述

1 个解决方案

解决方案1 1 2020-06-18 03:12:27

解决方案2 -2 2016-09-12 08:19:35

解决方案1
1 2020-06-18 03:12:27

解决方案2
-2 2016-09-12 08:19:35