Pyspark数据框到KMEANS的阵列RDD

Question

I am trying to run Kmeans clustering algo in Spark 2.2. 我正在尝试在Spark 2.2中运行Kmeans集群算法。 I am not able to find the correct input format. 我找不到正确的输入格式。 It gives TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector error. 它给出TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector错误。 I checked further that my inputrdd is an Row Rdd. 我进一步检查了我的inputrdd是Row Rdd。 CAn we convert it to an array RDD? 我们可以将其转换为数组RDD吗？ This MLlib Doc says shows that we can pass a paralleized array rdd data into the KMeans model. 这位MLlib Doc说，这表明我们可以将并行数组rdd数据传递到KMeans模型中。 Error occurs at KMeans.train step. 在KMeans.train步骤发生错误。

import pandas as pd
from pyspark.mllib.clustering import KMeans, KMeansModel
df = pd.DataFrame({"c1" : [1,2,3,4,5,6], "c2": [2,6,1,2,4,6], "c3" : [21,32,12,65,43,52]})
sdf = sqlContext.createDataFrame(df)
inputrdd = sdf.rdd
model = KMeans.train( inputrdd, 2, maxIterations=10, initializationMode="random",
               seed=50, initializationSteps=5, epsilon=1e-4)

inputrdd when .collect is called. inputrdd时的inputrdd。

[Row(c1=1, c2=2, c3=21),
 Row(c1=2, c2=6, c3=32),
 Row(c1=3, c2=1, c3=12),
 Row(c1=4, c2=2, c3=65),
 Row(c1=5, c2=4, c3=43),
 Row(c1=6, c2=6, c3=52)]

Answer 1

Following changes helped. 后续更改有所帮助。 I changed my Row rdd to Vector directly using Vectors.dense . 我直接使用Vectors.dense将Row rdd更改为Vector。

from pyspark.mllib.linalg import Vectors
inputrdd = sdf.rdd.map(lambda s : Vectors.dense(s))

Pyspark数据框到KMEANS的阵列RDD

问题描述

1 个解决方案

解决方案1
0 2018-02-24 07:31:25

Pyspark数据框到KMEANS的阵列RDD

问题描述

1 个解决方案

解决方案1 0 2018-02-24 07:31:25

解决方案1
0 2018-02-24 07:31:25