简体繁体中英

Does using sparse matrices make an algorithm slower or faster in Sklearn?

原文 2015-05-27 22:18:36 3 2 python/ machine-learning/ scikit-learn

I have a large but sparse train data. I would like to use it with ExtraTreeClassifier. I am not sure considering computational time whether I need to use sparse csr_matrix or the raw data. Which version of the data runs faster with that classifier and can we generalize its answer to all sparse capable models?

2 answers

If your data are sparse, the extra tree classifier will be faster with a csc_matrix. In doubt, I would suggest you to benchmark with both version.

All algorithms should benefit from using the appropriate sparse format if your data are sufficiently sparse. For instance, algorithms based on dot product will be a lot faster with sparse data.

Depends on your data

Memory consumption.

If your data is dense, a dense representation needs d*sizeof(double) bytes for your data (ie usually d * 8 bytes). A sparse representation usually needs sparsity*d*(sizeof(int)+sizeof(double)) . Depending on your programming language and code quality, it can also be much more due to memory management overhead. A typical Java implementation adds 8 bytes of overhead, and will round to 8 bytes size; so sparse vectors may easily use 16 + sparsity * d * 24 bytes. then.

If your sparsity is 1, this means a sparse representation needs 50% more memory. I guess the memory tradeoff in practise will be somewhere around 50% sparsity; and if your implementation isn't carefull optimized, maybe even 30% - so 1 out of 3 values should be a zero.

Memory consumption is usually a key problem. The more memory you use, the more pagefaults and cache misses your CPU will have, which can have a big impact on performance (which is why eg BLAS perform large matrix multiplications in block sizes optimized for your CPU caches).

Optimizations and SIMD.

Dense vector code (eg BLAS) is usually much better optimized than sparse operations. In particular, SIMD (single instruction, multiple data) CPU instructions usually only work with dense data.

Random access.

Many algorithms may need random access to vectors. If your data is represented as a double[] array, random access is O(1) . If your data is a sparse vector, random access usually is O(sparsity*d) , ie you will have to scan the vector to check if there is a value present. It may thus be beneficial to transpose the matrix for some operations, and work with sparse columns instead of sparse rows.

On the other hand, some algorithms may exactly benefit from this. But many implementations have such optimizations built in, and will take care of this automatically. Sometimes you also have different choices available. For example APRIORI works on rows, and thus will work well with row-sparse data. Eclat on the other hand is an algorithm to solve the same problem, but it first transforms all data into a row-sparse form, then even computes column differences to further optimize.

Code complexity.

Code to process sparse data usually is much more complex. In particular, it cannot make use of SSE and similar fast CPU instructions easily. It is one of the reasons why sparse matrix multiplications are much slower than dense operations - optimizing these operations without knowing certain characteristics of your data is surprisingly hard. :-(

Different results when using sklearn RandomizedPCA with sparse and dense matrices

Product of sparse matrices by sklearn TfidfVectorizer

Sparse precomputed Gram matrices in sklearn svm?

Using broadcasting with sparse scipy matrices

What is the correct way to mix feature sparse matrices with sklearn?

Using scipy sparse matrices to solve system of equations

Concatenate sparse matrices in Python using SciPy/Numpy

Concatenating two sparse matrices using sparse package in python

Pros and cons to using sparse matrices in python/R?

Performing PCA on large sparse matrix by using sklearn

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Different results when using sklearn RandomizedPCA with sparse and dense matrices Product of sparse matrices by sklearn TfidfVectorizer Sparse precomputed Gram matrices in sklearn svm? Using broadcasting with sparse scipy matrices What is the correct way to mix feature sparse matrices with sklearn? Using scipy sparse matrices to solve system of equations Concatenate sparse matrices in Python using SciPy/Numpy Concatenating two sparse matrices using sparse package in python Pros and cons to using sparse matrices in python/R? Performing PCA on large sparse matrix by using sklearn

Related Tags

Does using sparse matrices make an algorithm slower or faster in Sklearn?

Question

2 answers

solution1
0 2015-06-04 09:49:51

solution2
0 2015-06-04 10:05:37

Depends on your data

Memory consumption.

Optimizations and SIMD.

Random access.

Code complexity.

Does using sparse matrices make an algorithm slower or faster in Sklearn?

Question

2 answers

solution1 0 2015-06-04 09:49:51

solution2 0 2015-06-04 10:05:37

Depends on your data

Memory consumption.

Optimizations and SIMD.

Random access.

Code complexity.

solution1
0 2015-06-04 09:49:51

solution2
0 2015-06-04 10:05:37