简体繁体 English

稀疏与密集向量PySpark

[英]Sparse vs. Dense Vectors PySpark

原文 2018-07-17 15:52:11 8 1 python/ apache-spark/ machine-learning/ pyspark/ sparse-matrix

How can I know whether or not I should use a sparse or dense representation in PySpark? 我怎么知道我是否应该在PySpark中使用稀疏或密集的表示？ I understand the differences between them (sparse saves memory by only storing the non-zero indices and values), but performance-wise, are there any general heuristics that describe when to use sparse vectors over dense ones? 我理解它们之间的差异（稀疏通过仅存储非零索引和值来节省内存），但在性能方面，是否有任何一般的启发式方法描述何时使用稀疏向量而不是密集向量？

Is there a general "cutoff" dimension and percent of 0 values beyond which it is generally better to use sparse vectors? 是否有一般的“截止”维度和0值的百分比超过这个值通常更好地使用稀疏向量？ If not, how should I go about making the decision? 如果没有，我该怎么做才能做出决定？ Thanks. 谢谢。

1 个解决方案

The thing to remember is that pyspark.ml.linalg.Vector and pyspark.mllib.linalg.Vector are just compatibility layer between Python and Java API. 要记住的是， pyspark.ml.linalg.Vector和pyspark.mllib.linalg.Vector只是Python和Java API之间的兼容层。 There are not full featured or optimized linear algebra utilities and you shouldn't use them as such. 没有全功能或优化的线性代数实用程序，您不应该这样使用它们。 The available operations are either not designed for performance or just convert to standard NumPy array under the covers. 可用的操作要么不是为性能而设计的，要么只是转换为标准的NumPy数组。

When used with other ml / mllib tools there will be serialized and converted to Java equivalents so Python representation performance is mostly inconsequential. 当与其他ml / mllib工具mllib使用时，将序列化并转换为Java等价物，因此Python表示性能几乎是无关紧要的。

This means that the biggest real concern is storage and a simple rule of thumb is: 这意味着最大的真正关注点是存储，一个简单的经验法则是：

If on average half of the entries is zero it is better to use SparseVector . 如果平均一半条目为零，则最好使用SparseVector 。
Otherwise it is better to use DenseVector . 否则最好使用DenseVector 。