稀疏矩阵上的scikit-learn HashingVectorizer

Question

In scikit-learn, how can I run the HashingVectorizer on data already present in a scipy.sparse matrix? 在scikit-learn中，如何对scipy.sparse矩阵中已经存在的数据运行HashingVectorizer？

My data is in svmlight format, so I am loading it with sklearn.datasets.load_svmlight_file and get a scipy.sparse matrix to work on. 我的数据是svmlight格式，因此我正在使用sklearn.datasets.load_svmlight_file加载它，并获得一个scipy.sparse矩阵进行处理。

The TfidfTransformer from scikit-learn can be fed such a sparse matrix to transform it, but how can I give the same sparse matrix to the HashingVectorizer to use it instead? 可以将scikit-learn的TfidfTransformer馈入这样的稀疏矩阵以对其进行变换，但是如何将相同的稀疏矩阵提供给HashingVectorizer来代替呢？

EDIT: Is there maybe a series of method calls that can be used on the sparse matrix, maybe using the FeatureHasher ? 编辑：也许有一系列可以在稀疏矩阵上使用的方法调用，也许使用FeatureHasher吗？

EDIT 2: After a useful discussion with the user cfh below, the goal I have is to go from input: a sparse count matrix gotten from svmlight data to output: a matrix of token occurrences, such as the HashingVectorizer is giving. 编辑2：在与下面的用户cfh进行了有益的讨论之后，我的目标是从输入开始：从svmlight数据获得的稀疏计数矩阵到输出：令牌出现的矩阵，例如HashingVectorizer提供的矩阵。 How could this be done? 怎么办呢？

I provided a sample code below and would really appreciate some help on how to do that, thanks in advance: 我在下面提供了示例代码，在此先感谢您提供一些有关如何执行此操作的帮助：

from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from scipy.sparse import csr_matrix

# example data
X_train = np.array([[1., 1.], [2., 3.], [4., 0.]])
print "X_train: \n", X_train
# transform to scipy.sparse.csr.csr_matrix to be consistent with output from load_svmlight_file
X_train_crs = csr_matrix(X_train)
print "X_train_crs: \n", X_train_crs   
# no problem to run TfidfTransformer() on this csr matrix to get a transformed csr matrix
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X_train)
print "tfidf: \n", tfidf
# How do I use the HashingVectorizer with X_train_crs ?
hv = HashingVectorizer(n_features=2)

Answer 1

Hashing basically combines words randomly into a smaller number of buckets. 散列基本上将单词随机组合到较少数量的存储桶中。 With an already computed frequency matrix, you can emulate this like so: 使用已经计算出的频率矩阵，您可以像这样进行仿真：

n_features = X_train.shape[1]
n_desired_features = n_features / 5
buckets = np.random.random_integers(0, n_desired_features-1, size=n_features)
X_new = np.zeros((X_train.shape[0], n_desired_features), dtype=X_train.dtype)
for i in range(n_features):
    X_new[:,buckets[i]] += X_train[:,i]

Of course you can adjust the n_desired_features as you wish. 当然，您可以根据需要调整n_desired_features 。 Just make sure to use the same buckets for the test data as well. 只要确保对测试数据也使用相同的buckets 。

If you need to do the same for a sparse matrix, you can do this: 如果需要对稀疏矩阵执行相同的操作，则可以执行以下操作：

M = coo_matrix((repeat(1,n_features), (range(n_features), buckets)),
               shape=(n_features,n_desired_features))
X_new = X_train.dot(M)

稀疏矩阵上的scikit-learn HashingVectorizer

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-05-27 17:31:06

稀疏矩阵上的scikit-learn HashingVectorizer

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-05-27 17:31:06

解决方案1
3 已采纳 2015-05-27 17:31:06