Spark DataFrame zipWithIndex

Question

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them. 我正在使用DataFrame来读取.parquet文件，而不是将它们变成rdd来执行我想要对它们执行的正常处理。

So I have my file: 所以我有我的文件：

val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd 
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)

I get the following error even when I convert from a dataframe to RDD: 即使我从数据帧转换为RDD，我也会收到以下错误：

:26: error: value zipWithIndex is not a member of org.apache.spark.sql.Row ：26：错误：值zipWithIndex不是org.apache.spark.sql.Row的成员

Anyone know how to do what I am trying to do, essentially trying to get the value and the column index. 任何人都知道如何做我想做的事情，主要是试图获取值和列索引。

I was thinking something like: 我想的是：

val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length

but getting stuck on the last part as not sure how to do the same of zipWithIndex. 但由于不确定如何做同样的zipWithIndex而陷入最后一部分。

Answer 1

You can simply convert Row to Seq : 您可以简单地将Row转换为Seq ：

convRDD.flatMap(r => r.toSeq.zipWithIndex)

Important thing to note here is that extracting type information becomes tricky. 这里要注意的重要一点是提取类型信息变得棘手。 Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)] . Row.toSeq返回Seq[Any] ，结果RDD是RDD[(Any, Int)] 。

Spark DataFrame zipWithIndex

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-07-21 15:54:40

Spark DataFrame zipWithIndex

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-07-21 15:54:40

解决方案1
3 已采纳 2015-07-21 15:54:40