简体   繁体   中英

How to load numpy sparse array containing TFIDF from scikit into Kmeans

I have the following code. I have trained TFIDF using scikit vectorizer

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(min_df=1,stop_words="english")
term_freq_matrix = count_vectorizer.fit_transform(vectoriser.mydoclist)
# print "Vocabulary:", count_vectorizer.vocabulary_

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(term_freq_matrix)

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print len(count_vectorizer.get_feature_names())
# for term in count_vectorizer.get_feature_names(): 
    # [k for k in count_vectorizer.get_feature_names() if '#' in k]:
    # if '#' in term:
    # print term.encode('utf-8')
# print np.matrix(tf_idf_matrix.todense())
# np.savetxt("foo.csv", (np.matrix(tf_idf_matrix.todense())), delimiter=",")
# np.savetxt("foo.csv", tf_idf_matrix.toarray(),fmt="%.4e")

# with open("foo-c.csv","w") as outputFile:

# np.save("pick",np.array(tf_idf_matrix))
# np.savez("pick",data = tf_idf_matrix.data ,indices=tf_idf_matrix.indices,indptr =tf_idf_matrix.indptr, shape=tf_idf_matrix.shape )

I want to store it in a fashion which is fast and efficient . I have tried csv but the data goes in 20 gb . I cant load the csr matrix from pick file .I am getting error for K means algo saying the sequence provided .

Can anyone help me to load this csr matrix into K means for scikit

Assuming the tf_idf_matrix is a scipy.sparse csr_matrix , the savez line looks fine. But you'll have to reconstruct the matrix from those saved arrays (with sparse or tf code).

Sparse matrices don't have a pickle method, so the np.save approach won't work.

In [172]: np.array(As)
Out[172]: 
array(<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 1000 stored elements in Compressed Sparse Row format>, dtype=object)

With an array like this np.save tries to use pickle for the object.

Does sklearn provide any sparse save methods? scipy has a couple save methods that can handle sparse matrices. The one I've use the MATLAB compatible loadmat/savemat .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM