简体   繁体   中英

Does using sparse matrices make an algorithm slower or faster in Sklearn?

I have a large but sparse train data. I would like to use it with ExtraTreeClassifier. I am not sure considering computational time whether I need to use sparse csr_matrix or the raw data. Which version of the data runs faster with that classifier and can we generalize its answer to all sparse capable models?

If your data are sparse, the extra tree classifier will be faster with a csc_matrix. In doubt, I would suggest you to benchmark with both version.

All algorithms should benefit from using the appropriate sparse format if your data are sufficiently sparse. For instance, algorithms based on dot product will be a lot faster with sparse data.

Depends on your data

Memory consumption.

If your data is dense, a dense representation needs d*sizeof(double) bytes for your data (ie usually d * 8 bytes). A sparse representation usually needs sparsity*d*(sizeof(int)+sizeof(double)) . Depending on your programming language and code quality, it can also be much more due to memory management overhead. A typical Java implementation adds 8 bytes of overhead, and will round to 8 bytes size; so sparse vectors may easily use 16 + sparsity * d * 24 bytes. then.

If your sparsity is 1, this means a sparse representation needs 50% more memory. I guess the memory tradeoff in practise will be somewhere around 50% sparsity; and if your implementation isn't carefull optimized, maybe even 30% - so 1 out of 3 values should be a zero.

Memory consumption is usually a key problem. The more memory you use, the more pagefaults and cache misses your CPU will have, which can have a big impact on performance (which is why eg BLAS perform large matrix multiplications in block sizes optimized for your CPU caches).

Optimizations and SIMD.

Dense vector code (eg BLAS) is usually much better optimized than sparse operations. In particular, SIMD (single instruction, multiple data) CPU instructions usually only work with dense data.

Random access.

Many algorithms may need random access to vectors. If your data is represented as a double[] array, random access is O(1) . If your data is a sparse vector, random access usually is O(sparsity*d) , ie you will have to scan the vector to check if there is a value present. It may thus be beneficial to transpose the matrix for some operations, and work with sparse columns instead of sparse rows.

On the other hand, some algorithms may exactly benefit from this. But many implementations have such optimizations built in, and will take care of this automatically. Sometimes you also have different choices available. For example APRIORI works on rows, and thus will work well with row-sparse data. Eclat on the other hand is an algorithm to solve the same problem, but it first transforms all data into a row-sparse form, then even computes column differences to further optimize.

Code complexity.

Code to process sparse data usually is much more complex. In particular, it cannot make use of SSE and similar fast CPU instructions easily. It is one of the reasons why sparse matrix multiplications are much slower than dense operations - optimizing these operations without knowing certain characteristics of your data is surprisingly hard. :-(

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM