[英]Unable to convert an rdd with zipWithIndex to a dataframe in spark

I am unable to convert an rdd with zipWithIndex to a dataframe. 我无法将zipWithIndex的rdd转换为数据zipWithIndex

I have read from a file and I need to skip the first 3 records and then limit the records to row number 10. For this, I used rdd.zipwithindex . 我已从文件中读取内容,我需要跳过前3条记录,然后将记录限制为第10行。为此,我使用了rdd.zipwithindex

But afterwards, when I try to save the 7 records , I am not able to do so. 但是之后,当我尝试保存7条记录时,我将无法保存。

val df = spark.read.format("com.databricks.spark.csv")
                   .option("delimiter", delimValue)
                   .option("header", "false")

val df1 = df.rdd.zipWithIndex()
                .filter(x => { x._2 > 3&& x._2 <= 10;})
                .map(f => Row(f._1))

val skipValue = 3

val limitValue = 10

val delimValue = ","

df1.foreach(f2=> print(f2.toString))

scala> val df = spark.read.format("com.databricks.spark.csv").option("delimiter", delimValue).option("header", "true").load("/user/bigframe/ashwin/data1/datafile.txt")
df: org.apache.spark.sql.DataFrame = [empid: string, fname: string ... 2 more fields]

scala> val df1 = df.rdd.zipWithIndex().filter(x => { x._2 > skipValue && x._2 <= limitValue;}).map(f => Row(f._1))
df1: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[885] at map at <console>:38

scala> import spark.implicits._
import spark.implicits._

scala> df1.toDF
<console>:44: error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]

You get RDD[ROW] once you change dataframe to rdd , So to convert back to the dataframe you need to create dataframe by sqlContext.createDataframe() 一旦将dataframe更改为rdd ,就会得到RDD[ROW] ,因此要转换回该dataframe ,需要通过sqlContext.createDataframe()创建数据帧

Schema is also required to create the dataframe , In this case you can use the schema that was generated before in df 创建dataframe也需要模式,在这种情况下,您可以使用之前在df生成的模式

val df1 = df.rdd.zipWithIndex()
  .filter(x => { x._2 > 3&& x._2 <= 10})

val result = spark.sqlContext.createDataFrame(df1, df.schema)

Hope this helps! 希望这可以帮助!

This is probably of type RDD[Row] right now. 目前可能是RDD[Row]类型。 have you tried using the toDF function? 您是否尝试过使用toDF函数? You'll have to import spark.implicits._ as well. 您还必须import spark.implicits._

