[英]PySpark: convert RDD[DenseVector] to dataframe
I have the following RDD: 我有以下RDD:
rdd.take(5) gives me: rdd.take(5)给我:
[DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]),
DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699])]
I would like to make it a data frame which should look like: 我想使它成为一个数据框架,看起来像:
-------------------------------------------------------------------
| features |
-------------------------------------------------------------------
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0] |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
Is this possible? 这可能吗? I tried to use
df_new = sqlContext.createDataFrame(rdd,['features'])
, but it didn't work. 我尝试使用
df_new = sqlContext.createDataFrame(rdd,['features'])
,但是没有用。 Does anyone have any suggestion? 有人有什么建议吗? Thanks!
谢谢!
Map to tuples
first: 首先映射到
tuples
:
rdd.map(lambda x: (x, )).toDF(["features"])
Just keep in mind that as of Spark 2.0 there are two different Vector
implementation an ml
algorithms require pyspark.ml.Vector
. 请记住,从Spark 2.0开始,
ml
算法需要pyspark.ml.Vector
实现两种不同的Vector
实现。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.