[英]Spark DataFrame zipWithIndex
I am using a DataFrame to read in a .parquet files but than turning them into an rdd to do my normal processing I wanted to do on them. 我正在使用DataFrame来读取.parquet文件,而不是将它们变成rdd来执行我想要对它们执行的正常处理。
So I have my file: 所以我有我的文件:
val dataSplit = sqlContext.parquetFile("input.parquet")
val convRDD = dataSplit.rdd
val columnIndex = convRDD.flatMap(r => r.zipWithIndex)
I get the following error even when I convert from a dataframe to RDD: 即使我从数据帧转换为RDD,我也会收到以下错误:
:26: error: value zipWithIndex is not a member of org.apache.spark.sql.Row
:26:错误:值zipWithIndex不是org.apache.spark.sql.Row的成员
Anyone know how to do what I am trying to do, essentially trying to get the value and the column index. 任何人都知道如何做我想做的事情,主要是试图获取值和列索引。
I was thinking something like: 我想的是:
val dataSplit = sqlContext.parquetFile(inputVal.toString)
val schema = dataSplit.schema
val columnIndex = dataSplit.flatMap(r => 0 until schema.length
but getting stuck on the last part as not sure how to do the same of zipWithIndex. 但由于不确定如何做同样的zipWithIndex而陷入最后一部分。
You can simply convert Row
to Seq
: 您可以简单地将
Row
转换为Seq
:
convRDD.flatMap(r => r.toSeq.zipWithIndex)
Important thing to note here is that extracting type information becomes tricky. 这里要注意的重要一点是提取类型信息变得棘手。
Row.toSeq
returns Seq[Any]
and resulting RDD
is RDD[(Any, Int)]
. Row.toSeq
返回Seq[Any]
,结果RDD
是RDD[(Any, Int)]
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.