简体   繁体   English

稀疏矩阵上的scikit-learn HashingVectorizer

[英]scikit-learn HashingVectorizer on sparse matrix

In scikit-learn, how can I run the HashingVectorizer on data already present in a scipy.sparse matrix? 在scikit-learn中,如何对scipy.sparse矩阵中已经存在的数据运行HashingVectorizer?

My data is in svmlight format, so I am loading it with sklearn.datasets.load_svmlight_file and get a scipy.sparse matrix to work on. 我的数据是svmlight格式,因此我正在使用sklearn.datasets.load_svmlight_file加载它,并获得一个scipy.sparse矩阵进行处理。

The TfidfTransformer from scikit-learn can be fed such a sparse matrix to transform it, but how can I give the same sparse matrix to the HashingVectorizer to use it instead? 可以将scikit-learn的TfidfTransformer馈入这样的稀疏矩阵以对其进行变换,但是如何将相同的稀疏矩阵提供给HashingVectorizer来代替呢?

EDIT: Is there maybe a series of method calls that can be used on the sparse matrix, maybe using the FeatureHasher ? 编辑:也许有一系列可以在稀疏矩阵上使用的方法调用,也许使用FeatureHasher吗?

EDIT 2: After a useful discussion with the user cfh below, the goal I have is to go from input: a sparse count matrix gotten from svmlight data to output: a matrix of token occurrences, such as the HashingVectorizer is giving. 编辑2:在与下面的用户cfh进行了有益的讨论之后,我的目标是从输入开始:从svmlight数据获得的稀疏计数矩阵到输出:令牌出现的矩阵,例如HashingVectorizer提供的矩阵。 How could this be done? 怎么办呢?

I provided a sample code below and would really appreciate some help on how to do that, thanks in advance: 我在下面提供了示例代码,在此先感谢您提供一些有关如何执行此操作的帮助:

from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from scipy.sparse import csr_matrix

# example data
X_train = np.array([[1., 1.], [2., 3.], [4., 0.]])
print "X_train: \n", X_train
# transform to scipy.sparse.csr.csr_matrix to be consistent with output from load_svmlight_file
X_train_crs = csr_matrix(X_train)
print "X_train_crs: \n", X_train_crs   
# no problem to run TfidfTransformer() on this csr matrix to get a transformed csr matrix
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X_train)
print "tfidf: \n", tfidf
# How do I use the HashingVectorizer with X_train_crs ?
hv = HashingVectorizer(n_features=2)

Hashing basically combines words randomly into a smaller number of buckets. 散列基本上将单词随机组合到较少数量的存储桶中。 With an already computed frequency matrix, you can emulate this like so: 使用已经计算出的频率矩阵,您可以像这样进行仿真:

n_features = X_train.shape[1]
n_desired_features = n_features / 5
buckets = np.random.random_integers(0, n_desired_features-1, size=n_features)
X_new = np.zeros((X_train.shape[0], n_desired_features), dtype=X_train.dtype)
for i in range(n_features):
    X_new[:,buckets[i]] += X_train[:,i]

Of course you can adjust the n_desired_features as you wish. 当然,您可以根据需要调整n_desired_features Just make sure to use the same buckets for the test data as well. 只要确保对测试数据也使用相同的buckets

If you need to do the same for a sparse matrix, you can do this: 如果需要对稀疏矩阵执行相同的操作,则可以执行以下操作:

M = coo_matrix((repeat(1,n_features), (range(n_features), buckets)),
               shape=(n_features,n_desired_features))
X_new = X_train.dot(M)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM