How to efficiently calculate huge matrix multiplication (tfidf features) in Python?

Question

I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:

from sklearn.feature_extraction.text import TfidfVectorizer
#c = [doc1, doc2, ..., docn]
vec = TfidfVectorizer()
X = vec.fit_transform(c)
del vec
Y = X * X.T

Works perfectly fine, but unfortunately, not for my very large datasets. X has a dimension of (350363, 2526183) and hence, the output matrix Y should have (350363, 350363) . X is very sparse due to the tfidf features, and hence, easily fits into memory (around 2GB only). Yet, the multiplication gives me a memory error after running for some time (even though the memory is not full but I suppose that scipy is so clever as to expect the memory usage).

I have already tried to play around with the dtypes without any success. I have also made sure that numpy and scipy have their BLAS libraries linked -- whereas this does not have an effect on the csr_matrix dot functionality as it is implemented in C. I thought of maybe using things like memmap, but I am not sure about that.

Does anyone have an idea of how to best approach this?

Answer 1

Even though X is sparse, X * XT probably won't, notice, that it just needs one nonzero common element in a given pair of rows. You are working with NLP task, so I am pretty sure that there are huge amounts of words which occur in nearly all documents (and as said before - it does not have to be one word for all pairs, but one (possibly different) for each pair. As a result you get a matrix of 350363^2 elements which has about 122,000,000,000 elements, if you don't have 200GB of ram, it does not look computable. Try to perform much more aggresive filtering of words in order to force X * XT to be sparse (remove many common words)

In general you won't be able to compute Gram matrix of big data , unless you enforce the sparsity of the X * XT , so most of your vectors' pairs (documents) have 0 "similarity". It can be done in numerous ways, the easiest way is to set some threshold T under which you treat <a,b> as 0 and compute the dot product by yourself, and create an entry in the resulting sparse matrix iff <a,b> > T

Answer 2

You may want to look at the random_projection module in scikit-learn. The Johnson-Lindenstrauss lemma says that a random projection matrix is guaranteed to preserve pairwise distances up to some tolerance eta , which is a hyperparameter when you calculate the number of random projections needed.

To cut a long story short, the scikit-learn class SparseRandomProjection seen here is a transformer to do this for you. If you run it on X after vec.fit_transform you should see a fairly large reduction in feature size.

The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim shows that to preserve up to a 10% tolerance, you only need johnson_lindenstrauss_min_dim(350363, .1) 10942 features. This is an upper bound, so you may be able to get away with much less. Even 1% tolerance would only need johnson_lindenstrauss_min_dim(350363, .01) 1028192 features which is still significantly less than you have right now.

EDIT: Simple thing to try - if your data is dtype='float64', try using 'float32'. That alone can save a massive amount of space, especially if you do not need double precision.

If the issue is that you cannot store the "final matrix" in memory either, I would recommend working with the data in an HDF5Store (as seen in pandas using pytables). This link has some good starter code, and you could iteratively calculate chunks of your dot product and write to disk. I have been using this extensively in a recent project on a 45GB dataset, and could provide more help if you decide to go this route.

Answer 3

What you could do is slice a row and a column of X, multiply those and save the resulting row to a file. Then move to the next row and column.

It is still the same amount of calculation work but you wouldn't run out of memory.

Using multiprocessing.Pool.map() or multiprocessing.Pool.map_async() you migt be able to speed it up, provided you use numpy.memmap() to read the matrix in the mapped function. And you would probably have to write each of the calculated rows to a separate file to merge them later. If you were to return the row from the mapped function it would have to be transferred back to the original process. That would take a lot of memory and IPC bandwidth.

How to efficiently calculate huge matrix multiplication (tfidf features) in Python?

Question

3 answers

solution1
6 2014-08-03 12:51:44

solution2
5 ACCPTED 2014-08-04 08:15:42

solution3
1 2014-08-03 14:59:14

How to efficiently calculate huge matrix multiplication (tfidf features) in Python?

Question

3 answers

solution1 6 2014-08-03 12:51:44

solution2 5 ACCPTED 2014-08-04 08:15:42

solution3 1 2014-08-03 14:59:14

solution1
6 2014-08-03 12:51:44

solution2
5 ACCPTED 2014-08-04 08:15:42

solution3
1 2014-08-03 14:59:14