无法将带有zipWithIndex的rdd转换为Spark中的数据帧

Question

我无法将zipWithIndex的rdd转换为数据zipWithIndex 。

我已从文件中读取内容，我需要跳过前3条记录，然后将记录限制为第10行。为此，我使用了rdd.zipwithindex 。

但是之后，当我尝试保存7条记录时，我将无法保存。

val df = spark.read.format("com.databricks.spark.csv")
                   .option("delimiter", delimValue)
                   .option("header", "false")
                   .load("/user/ashwin/data1/datafile.txt")

val df1 = df.rdd.zipWithIndex()
                .filter(x => { x._2 > 3&& x._2 <= 10;})
                .map(f => Row(f._1))

val skipValue = 3

val limitValue = 10

val delimValue = ","

df1.foreach(f2=> print(f2.toString))
[[113,3Bapi,Ghosh,86589579]][[114,4Bapi,Ghosh,86589579]]
[[115,5Bapi,Ghosh,86589579]][[116,6Bapi,Ghosh,86589579]]
[[117,7Bapi,Ghosh,86589579]][[118,8Bapi,Ghosh,86589579]]
[[119,9Bapi,Ghosh,86589579]]



scala> val df = spark.read.format("com.databricks.spark.csv").option("delimiter", delimValue).option("header", "true").load("/user/bigframe/ashwin/data1/datafile.txt")
df: org.apache.spark.sql.DataFrame = [empid: string, fname: string ... 2 more fields]

scala> val df1 = df.rdd.zipWithIndex().filter(x => { x._2 > skipValue && x._2 <= limitValue;}).map(f => Row(f._1))
df1: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[885] at map at <console>:38

scala> import spark.implicits._
import spark.implicits._

scala> df1。

    ++             count                 flatMap                 groupBy           mapPartitionsWithIndex   reduce             takeAsync         union
aggregate      countApprox           fold                    id                max                      repartition        takeOrdered       unpersist
cache          countApproxDistinct   foreach                 intersection      min                      sample             takeSample        zip
cartesian      countAsync            foreachAsync            isCheckpointed    name                     saveAsObjectFile   toDebugString     zipPartitions
checkpoint     countByValue          foreachPartition        isEmpty           partitioner              saveAsTextFile     toJavaRDD         zipWithIndex
coalesce       countByValueApprox    foreachPartitionAsync   iterator          partitions               setName            toLocalIterator   zipWithUniqueId
collect        dependencies          getCheckpointFile       keyBy             persist                  sortBy             toString
collectAsync   distinct              getNumPartitions        localCheckpoint   pipe                     sparkContext       top
compute        filter                getStorageLevel         map               preferredLocations       subtract           treeAggregate
context        first                 glom                    mapPartitions     randomSplit              take               treeReduce

scala> df1.toDF
<console>:44: error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
       df1.toDF
       ^

Answer 1

一旦将dataframe更改为rdd ，就会得到RDD[ROW] ，因此要转换回该dataframe ，需要通过sqlContext.createDataframe()创建数据帧

创建dataframe也需要模式，在这种情况下，您可以使用之前在df生成的模式

val df1 = df.rdd.zipWithIndex()
  .filter(x => { x._2 > 3&& x._2 <= 10})
  .map(_._1)

val result = spark.sqlContext.createDataFrame(df1, df.schema)

希望这可以帮助！

Answer 2

目前可能是RDD[Row]类型。 您是否尝试过使用toDF函数？ 您还必须import spark.implicits._ 。

无法将带有zipWithIndex的rdd转换为Spark中的数据帧

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-03-20 13:06:05

解决方案2
0 2018-03-20 12:57:32

无法将带有zipWithIndex的rdd转换为Spark中的数据帧

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-03-20 13:06:05

解决方案2 0 2018-03-20 12:57:32

解决方案1
2 已采纳 2018-03-20 13:06:05

解决方案2
0 2018-03-20 12:57:32