scikit-learn - 关于HashingVectorizer的Tfidf

Question

I am using SciKit Learn to perform some analytics on a large dataset (+- 34.000 files). 我正在使用SciKit Learn对大型数据集（+ - 34.000个文件）执行一些分析。 Now I was wondering. 现在我在想。 The HashingVectorizer aims on low memory usage. HashingVectorizer旨在降低内存使用率。 Is it possible to first convert a bunch of files to HashingVectorizer objects (using pickle.dump) and then load all these files together and convert them to TfIdf features? 是否可以首先将一堆文件转换为HashingVectorizer对象（使用pickle.dump），然后将所有这些文件一起加载并将它们转换为TfIdf功能？ These features can be calculated from the HashingVectorizer, because counts are stored and the number of documents can be deduced. 可以从HashingVectorizer计算这些功能，因为存储了计数并且可以推导出文档的数量。 I now have the following: 我现在有以下内容：

for text in texts:
    vectorizer = HashingVectorizer(norm=None, non_negative=True)
    features = vectorizer.fit_transform([text])
    with open(path, 'wb') as handle:
        pickle.dump(features, handle)

Then, loading the files is trivial: 然后，加载文件是微不足道的：

data = []
for path in paths:
    with open(path, 'rb') as handle:
        data.append(pickle.load(handle))
tfidf = TfidfVectorizer()
tfidf.fit_transform(data)

But, the magic does not happen. 但是，魔法不会发生。 How can I let the magic happen? 我怎么能让魔法发生？

Answer 1

It seems the problem is you are trying to vectorizing your text twice. 似乎问题是你试图将文本两次矢量化。 Once you have built a matrix of counts, you should be able to transform the counts to tf-idf features using sklearn.feature_extraction.text.TfidfTransformer instead of TfidfVectorizer . 一旦构建了计数矩阵，您应该能够使用sklearn.feature_extraction.text.TfidfTransformer而不是TfidfVectorizer将计数转换为tf-idf功能。

Also, it appears your saved data is a sparse matrix. 此外，您的保存数据似乎是一个稀疏矩阵。 You should be stacking the loaded matrices using scipy.sparse.vstack() instead of passing a list of matrices to TfidfTransformer 您应该使用scipy.sparse.vstack()堆叠加载的矩阵，而不是将矩阵列表传递给TfidfTransformer

Answer 2

I'm quite worried by your loop 我很担心你的循环

for text in texts:
    vectorizer = HashingVectorizer(norm=None, non_negative=True)
    features = vectorizer.fit_transform([text])

Each time you re-fit your vectoriser, maybe it will forget its vocabulary, and so the entries in each vector won't correspond to the same words (not sure about this i guess it depends on how they do the hashing); 每次你重新适合你的矢量化器时，它可能会忘记它的词汇量，所以每个矢量中的条目都不会对应相同的单词（不确定这个我猜它取决于它们如何进行散列）; why not just fit it on the whole corpus, ie 为什么不把它放在整个语料库上，即

    features = vectorizer.fit_transform(texts)

For you actual question, it sounds like you are just trying to normalise the columns of your data matrix by the IDF; 对于您的实际问题，听起来您只是想通过IDF规范化data矩阵的列; you should be able to do this directly on the arrays (i've converted to numpy arrays since I can't work out how the indexing works on the scipy arrays). 你应该能够直接在数组上执行此操作（我已经转换为numpy数组，因为我无法弄清楚索引如何在scipy数组上工作）。 The mask DF != 0 is necessary since you used the hashing vectoriser which has 2^20 columns: 由于您使用了具有2 ^ 20列的散列矢量化器，因此需要屏蔽DF != 0 ：

import numpy as np
X = np.array(features.todense())
DF = (X != 0).sum(axis=0)
X_TFIDF = X[:,DF != 0]/DF[DF != 0]

scikit-learn - 关于HashingVectorizer的Tfidf

问题描述

2 个解决方案

解决方案1
3 2016-01-05 07:37:25

解决方案2
0 2016-01-04 23:00:05

scikit-learn - 关于HashingVectorizer的Tfidf

问题描述

2 个解决方案

解决方案1 3 2016-01-05 07:37:25

解决方案2 0 2016-01-04 23:00:05

解决方案1
3 2016-01-05 07:37:25

解决方案2
0 2016-01-04 23:00:05