[英]SparseVector to DenseVector conversion in Pyspark
Unexpected errors when converting a SparseVector to a DenseVector in PySpark 1.4.1: 在PySpark 1.4.1中将SparseVector转换为DenseVector时出现意外错误:
from pyspark.mllib.linalg import SparseVector, DenseVector
DenseVector(SparseVector(5, {4: 1.}))
This runs properly on Ubuntu, running pyspark, returning: 这在Ubuntu上正常运行,运行pyspark,返回:
DenseVector([0.0, 0.0, 0.0, 0.0, 1.0])
DenseVector([0.0,0.0,0.0,0.0,1.0])
This results into an error on RedHat, running pyspark, returning: 这导致RedHat出错,运行pyspark,返回:
Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/mllib/linalg.py", line 206, in init ar = np.array(ar, dtype=np.float64) File "/usr/lib/spark/python/pyspark/mllib/linalg.py", line 673, in getitem raise ValueError("Index %d out of bounds." % index) ValueError: Index 5 out of bounds.
回溯(最近一次调用最后一次):文件“”,第1行,在文件“/usr/lib/spark/python/pyspark/mllib/linalg.py”,第206行,在init ar = np.array(ar,dtype) = np.float64)文件“/usr/lib/spark/python/pyspark/mllib/linalg.py”,第673行,在getitem中引发ValueError(“索引%d超出范围。”%index)ValueError:Index 5 out边界
Also, on both platform, evaluating the following also results into an error: 此外,在这两个平台上,评估以下内容也会导致错误:
DenseVector(SparseVector(5, {0: 1.}))
I would expect: 我希望:
DenseVector([1.0, 0.0, 0.0, 0.0, 0.0])
DenseVector([1.0,0.0,0.0,0.0,0.0])
but get: 但得到:
Traceback (most recent call last): File "", line 1, in File "/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py", line 206, in init ar = np.array(ar, dtype=np.float64) File "/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py", line 676, in getitem row_ind = inds[insert_index] IndexError: index out of bounds
回溯(最近一次调用最后一次):文件“”,第1行,在文件“/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py”,第206行,in INIT AR = np.array(AR,D型细胞= np.float64)文件“/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py”,线路676,在的GetItem row_ind = inds [insert_index] IndexError:索引超出范围
Note: this error message is different from the previous one, although the error occurs in the same function (code at https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg.html ) 注意:此错误消息与前一个错误消息不同,尽管错误发生在同一个函数中(代码位于https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg.html )
Spark 2.0.2+ Spark 2.0.2+
You should be able to iterate SparseVectors
. 您应该能够迭代
SparseVectors
。 See: SPARK-17587 . 见: SPARK-17587 。
Spark < 2.0.2 Spark <2.0.2
Well, the first case is quite interesting but overall behavior doesn't look like a bug at all. 嗯,第一个案例非常有趣,但整体行为看起来并不像一个bug。 If you take a look at the DenseVector
constructor it considers only two cases. 如果你看看
DenseVector
构造函数,它只考虑两种情况。
ar
is a bytes
object ( immutable sequence of integers in the range 0 <= x < 256 ) ar
是一个bytes
对象( 0 <= x <256范围内的不可变整数序列 ) np.array(ar, dtype=np.float64)
np.array(ar, dtype=np.float64)
SparseVector
is clearly not a bytes
object so when pass it to the constructor it is used a an object
parameter for np.array
call. SparseVector
显然不是一个bytes
对象,因此当它传递给构造函数时,它被用作np.array
调用的object
参数。 If you check numpy.array
docs you learn that object
should be 如果你检查
numpy.array
文档,你就会知道该object
应该是
An array, any object exposing the array interface , an object whose
__array__
method returns an array, or any (nested) sequence.数组,公开数组接口的任何对象,
__array__
方法返回数组的对象,或任何(嵌套)序列。
You can check that SparseVector
doesn't meet above criteria. 您可以检查
SparseVector
是否不符合上述条件。 It is not a Python sequence type and: 它不是Python 序列类型,并且:
>>> sv = SparseVector(5, {4: 1.})
>>> isinstance(sv, np.ndarray)
False
>>> hasattr(sv, "__array_interface__")
False
>>> hasattr(sv, "__array__")
False
>>> hasattr(sv, "__iter__")
False
If you want to convert SparseVector
to DenseVector
you should probably use toArray
method: 如果要将
SparseVector
转换为DenseVector
您应该使用toArray
方法:
DenseVector(sv.toArray())
Edit : 编辑 :
I think this behavior explains why DenseVector(SparseVector(...))
may work in some cases: 我认为这种行为解释了为什么
DenseVector(SparseVector(...))
在某些情况下可能会起作用:
>>> [x for x in SparseVector(5, {0: 1.})]
[1.0]
>>> [x for x in SparseVector(5, {4: 1.})]
Traceback (most recent call last):
...
ValueError: Index 5 out of bounds.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.