Pyspark中的SparseVector到DenseVector转换

Question

Unexpected errors when converting a SparseVector to a DenseVector in PySpark 1.4.1: 在PySpark 1.4.1中将SparseVector转换为DenseVector时出现意外错误：

from pyspark.mllib.linalg import SparseVector, DenseVector

DenseVector(SparseVector(5, {4: 1.}))

This runs properly on Ubuntu, running pyspark, returning: 这在Ubuntu上正常运行，运行pyspark，返回：

DenseVector([0.0, 0.0, 0.0, 0.0, 1.0]) DenseVector（[0.0,0.0,0.0,0.0,1.0]）

This results into an error on RedHat, running pyspark, returning: 这导致RedHat出错，运行pyspark，返回：

Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/mllib/linalg.py", line 206, in init ar = np.array(ar, dtype=np.float64) File "/usr/lib/spark/python/pyspark/mllib/linalg.py", line 673, in getitem raise ValueError("Index %d out of bounds." % index) ValueError: Index 5 out of bounds. 回溯（最近一次调用最后一次）：文件“”，第1行，在文件“/usr/lib/spark/python/pyspark/mllib/linalg.py”，第206行，在init ar = np.array（ar，dtype） = np.float64）文件“/usr/lib/spark/python/pyspark/mllib/linalg.py”，第673行，在getitem中引发ValueError（“索引％d超出范围。”％index）ValueError：Index 5 out边界

Also, on both platform, evaluating the following also results into an error: 此外，在这两个平台上，评估以下内容也会导致错误：

DenseVector(SparseVector(5, {0: 1.}))

I would expect: 我希望：

DenseVector([1.0, 0.0, 0.0, 0.0, 0.0]) DenseVector（[1.0,0.0,0.0,0.0,0.0]）

but get: 但得到：

Ubuntu: Ubuntu的：

Traceback (most recent call last): File "", line 1, in File "/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py", line 206, in init ar = np.array(ar, dtype=np.float64) File "/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py", line 676, in getitem row_ind = inds[insert_index] IndexError: index out of bounds 回溯（最近一次调用最后一次）：文件“”，第1行，在文件“/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py”，第206行，in INIT AR = np.array（AR，D型细胞= np.float64）文件“/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py”，线路676，在的GetItem row_ind = inds [insert_index] IndexError：索引超出范围

Note: this error message is different from the previous one, although the error occurs in the same function (code at https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg.html ) 注意：此错误消息与前一个错误消息不同，尽管错误发生在同一个函数中（代码位于https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg.html ）

RedHat: the same command results into a Segmentation Fault, which crashes Spark. RedHat：相同的命令导致分段错误，这会导致Spark崩溃。

Answer 1

Spark 2.0.2+ Spark 2.0.2+

You should be able to iterate SparseVectors . 您应该能够迭代SparseVectors 。 See: SPARK-17587 . 见： SPARK-17587 。

Spark < 2.0.2 Spark <2.0.2

~~Well, the first case is quite interesting but overall behavior doesn't look like a bug at all.~~ ~~嗯，第一个案例非常有趣，但整体行为看起来并不像一个bug。~~ If you take a look at the DenseVector constructor it considers only two cases. 如果你看看DenseVector构造函数，它只考虑两种情况。

ar is a bytes object ( immutable sequence of integers in the range 0 <= x < 256 ) ar是一个bytes对象（ 0 <= x <256范围内的不可变整数序列 ）
Otherwise we simply call np.array(ar, dtype=np.float64) 否则我们只需调用np.array(ar, dtype=np.float64)

SparseVector is clearly not a bytes object so when pass it to the constructor it is used a an object parameter for np.array call. SparseVector显然不是一个bytes对象，因此当它传递给构造函数时，它被用作np.array调用的object参数。 If you check numpy.array docs you learn that object should be 如果你检查numpy.array文档，你就会知道该object应该是

An array, any object exposing the array interface , an object whose __array__ method returns an array, or any (nested) sequence. 数组，公开数组接口的任何对象， __array__方法返回数组的对象，或任何（嵌套）序列。

You can check that SparseVector doesn't meet above criteria. 您可以检查SparseVector是否不符合上述条件。 It is not a Python sequence type and: 它不是Python 序列类型，并且：

>>> sv = SparseVector(5, {4: 1.})
>>> isinstance(sv, np.ndarray)
False
>>> hasattr(sv, "__array_interface__")
False
>>> hasattr(sv, "__array__")
False
>>> hasattr(sv, "__iter__")
False

If you want to convert SparseVector to DenseVector you should probably use toArray method: 如果要将SparseVector转换为DenseVector您应该使用toArray方法：

DenseVector(sv.toArray())

Edit : 编辑：

I think this behavior explains why DenseVector(SparseVector(...)) may work in some cases: 我认为这种行为解释了为什么DenseVector(SparseVector(...))在某些情况下可能会起作用：

>>> [x for x in SparseVector(5, {0: 1.})]
[1.0]
>>> [x for x in SparseVector(5, {4: 1.})]
Traceback (most recent call last):
...
ValueError: Index 5 out of bounds.

Pyspark中的SparseVector到DenseVector转换

问题描述

1 个解决方案

解决方案1
8 2015-09-08 23:14:01

Pyspark中的SparseVector到DenseVector转换

问题描述

1 个解决方案

解决方案1 8 2015-09-08 23:14:01

解决方案1
8 2015-09-08 23:14:01