简体   繁体   English

Spark DataFrame zipWithIndex

[英]Spark DataFrame zipWithIndex

I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them. 我正在使用DataFrame来读取.parquet文件,而不是将它们变成rdd来执行我想要对它们执行的正常处理。

So I have my file: 所以我有我的文件:

val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd 
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)

I get the following error even when I convert from a dataframe to RDD: 即使我从数据帧转换为RDD,我也会收到以下错误:

:26: error: value zipWithIndex is not a member of org.apache.spark.sql.Row :26:错误:值zipWithIndex不是org.apache.spark.sql.Row的成员

Anyone know how to do what I am trying to do, essentially trying to get the value and the column index. 任何人都知道如何做我想做的事情,主要是试图获取值和列索引。

I was thinking something like: 我想的是:

val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length

but getting stuck on the last part as not sure how to do the same of zipWithIndex. 但由于不确定如何做同样的zipWithIndex而陷入最后一部分。

You can simply convert Row to Seq : 您可以简单地将Row转换为Seq

convRDD.flatMap(r => r.toSeq.zipWithIndex)

Important thing to note here is that extracting type information becomes tricky. 这里要注意的重要一点是提取类型信息变得棘手。 Row.toSeq returns Seq[Any] and resulting RDD is RDD[(Any, Int)] . Row.toSeq返回Seq[Any] ,结果RDDRDD[(Any, Int)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM