简体   繁体   中英

scipy/sklearn sparse matrix decomposition for document classification

I'm trying to do documentation classification on a large corpus (4 mil documents) and keep running into memory errors when using the standard scikit-learn methods. After cleaning/stemming my data, I have a very sparse matrix with about 1 mil words. My first thought was to use sklearn.decomposition.TruncatedSVD, but I can't perform the .fit() operation with a large enough k due to memory errors (the largest I can do only accounts for 25% of the variance of the data). I tried following the sklearn classification here , but still run out of memory when doing the KNN classification. I'd like to manually do an out-of-core matrix transformation to apply PCA/SVD to the matrix to reduce the dimensionality, but need a way to first calculate the eigenvectors. I was hoping to use scipy.sparse.linalg.eigs Is there a method to do calculate the eigenvector matrix in order to complete the code shown below?

from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sp
import numpy as np
import cPickle as pkl
from sklearn.neighbors import KNeighborsClassifier

def pickleLoader(pklFile):
    try:
        while True:
            yield pkl.load(pklFile)
    except EOFError:
        pass

#sample docs
docs = ['orange green','purple green','green chair apple fruit','raspberry pie banana yellow','green raspberry hat ball','test row green apple']
classes = [1,0,1,0,0,1]
#first k eigenvectors to keep
k = 3

#returns sparse matrix
tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(docs)

#write sparse matrix to file
pkl.dump(tfs, open('pickleTest.p', 'wb'))



#NEEDED - THIS LINE THAT CALCULATES top k eigenvectors   
del tfs

x = np.empty([len(docs),k])

#iterate over sparse matrix
with open('D:\\GitHub\\Avitro-Classification\\pickleTest.p') as f:
    rowCounter = 0
    for dataRow in pickleLoader(f):
        colCounter = 0
        for col in k:
            x[rowCounter, col] = np.sum(dataRow * eingenvectors[:,col])
f.close()

clf = KNeighborsClassifier(n_neighbors=10) 
clf.fit(x, k_class)

Any assistance or guidance would be much appreciated! If there is a better way to do this I'm happy to try a different approach, but I would like to try KNN on this large sparse dataset, preferably using some dimensionality reduction (this performed very well on the small test dataset I ran - I'd hate to lose my performance due to stupid memory constraints!)

Edit: Here's the code I first tried to run, that led me down the path of doing my own out-of-core sparse PCA implementation. Any help on fixing this memory error would make this much easier!

from sklearn.decomposition import TruncatedSVD
import pickle

dataFolder = 'D:\\GitHub\\project\\'

# in the form of a list: [word sample test word, big sample test word test, green apple test word]
descWords = pickle.load(open(dataFolder +'descriptionWords.p'))

vectorizer = TfidfVectorizer()
X_words = vectorizer.fit_transform(descWords)

print np.shape(X_words)

del descWords
del vectorizer

svd = TruncatedSVD(algorithm='randomized', n_components=50000, random_state=42)
output = svd.fit_transform(X_words)

with output:

(3995803, 923633)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-27-c0db86bd3830> in <module>()
     16 
     17 svd = TruncatedSVD(algorithm='randomized', n_components=50000, random_state=42)
---> 18 output = svd.fit_transform(X_words)

C:\Python27\lib\site-packages\sklearn\decomposition\truncated_svd.pyc in fit_transform(self, X, y)
    173             U, Sigma, VT = randomized_svd(X, self.n_components,
    174                                           n_iter=self.n_iter,
--> 175                                           random_state=random_state)
    176         else:
    177             raise ValueError("unknown algorithm %r" % self.algorithm)

C:\Python27\lib\site-packages\sklearn\utils\extmath.pyc in randomized_svd(M, n_components, n_oversamples, n_iter, transpose, flip_sign, random_state, n_iterations)
    297         M = M.T
    298 
--> 299     Q = randomized_range_finder(M, n_random, n_iter, random_state)
    300 
    301     # project M to the (k + p) dimensional space using the basis vectors

C:\Python27\lib\site-packages\sklearn\utils\extmath.pyc in randomized_range_finder(A, size, n_iter, random_state)
    212 
    213     # generating random gaussian vectors r with shape: (A.shape[1], size)
--> 214     R = random_state.normal(size=(A.shape[1], size))
    215 
    216     # sampling the range of A using by linear projection of r

C:\Python27\lib\site-packages\numpy\random\mtrand.pyd in mtrand.RandomState.normal (numpy\random\mtrand\mtrand.c:9968)()

C:\Python27\lib\site-packages\numpy\random\mtrand.pyd in mtrand.cont2_array_sc (numpy\random\mtrand\mtrand.c:2370)()

MemoryError: 

Out-of-core SVD or PCA on sparse data is not implemented in scikit-learn 0.15.2. You might want to try gensim instead.

Edit : I forgot to specify "on sparse data" in my first reply.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM