[英]Unable to convert an rdd with zipWithIndex to a dataframe in spark
I am unable to convert an rdd with zipWithIndex
to a dataframe. 我无法将
zipWithIndex
的rdd转换为数据zipWithIndex
。
I have read from a file and I need to skip the first 3 records and then limit the records to row number 10. For this, I used rdd.zipwithindex
. 我已从文件中读取内容,我需要跳过前3条记录,然后将记录限制为第10行。为此,我使用了
rdd.zipwithindex
。
But afterwards, when I try to save the 7 records , I am not able to do so. 但是之后,当我尝试保存7条记录时,我将无法保存。
val df = spark.read.format("com.databricks.spark.csv")
.option("delimiter", delimValue)
.option("header", "false")
.load("/user/ashwin/data1/datafile.txt")
val df1 = df.rdd.zipWithIndex()
.filter(x => { x._2 > 3&& x._2 <= 10;})
.map(f => Row(f._1))
val skipValue = 3
val limitValue = 10
val delimValue = ","
df1.foreach(f2=> print(f2.toString))
[[113,3Bapi,Ghosh,86589579]][[114,4Bapi,Ghosh,86589579]]
[[115,5Bapi,Ghosh,86589579]][[116,6Bapi,Ghosh,86589579]]
[[117,7Bapi,Ghosh,86589579]][[118,8Bapi,Ghosh,86589579]]
[[119,9Bapi,Ghosh,86589579]]
scala> val df = spark.read.format("com.databricks.spark.csv").option("delimiter", delimValue).option("header", "true").load("/user/bigframe/ashwin/data1/datafile.txt")
df: org.apache.spark.sql.DataFrame = [empid: string, fname: string ... 2 more fields]
scala> val df1 = df.rdd.zipWithIndex().filter(x => { x._2 > skipValue && x._2 <= limitValue;}).map(f => Row(f._1))
df1: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[885] at map at <console>:38
scala> import spark.implicits._
import spark.implicits._
scala> df1. scala> df1。
++ count flatMap groupBy mapPartitionsWithIndex reduce takeAsync union
aggregate countApprox fold id max repartition takeOrdered unpersist
cache countApproxDistinct foreach intersection min sample takeSample zip
cartesian countAsync foreachAsync isCheckpointed name saveAsObjectFile toDebugString zipPartitions
checkpoint countByValue foreachPartition isEmpty partitioner saveAsTextFile toJavaRDD zipWithIndex
coalesce countByValueApprox foreachPartitionAsync iterator partitions setName toLocalIterator zipWithUniqueId
collect dependencies getCheckpointFile keyBy persist sortBy toString
collectAsync distinct getNumPartitions localCheckpoint pipe sparkContext top
compute filter getStorageLevel map preferredLocations subtract treeAggregate
context first glom mapPartitions randomSplit take treeReduce
scala> df1.toDF
<console>:44: error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
df1.toDF
^
You get RDD[ROW]
once you change dataframe
to rdd
, So to convert back to the dataframe
you need to create dataframe by sqlContext.createDataframe()
一旦将
dataframe
更改为rdd
,就会得到RDD[ROW]
,因此要转换回该dataframe
,需要通过sqlContext.createDataframe()
创建数据帧
Schema is also required to create the dataframe
, In this case you can use the schema that was generated before in df
创建
dataframe
也需要模式,在这种情况下,您可以使用之前在df
生成的模式
val df1 = df.rdd.zipWithIndex()
.filter(x => { x._2 > 3&& x._2 <= 10})
.map(_._1)
val result = spark.sqlContext.createDataFrame(df1, df.schema)
Hope this helps! 希望这可以帮助!
This is probably of type RDD[Row]
right now. 目前可能是
RDD[Row]
类型。 have you tried using the toDF
function? 您是否尝试过使用
toDF
函数? You'll have to import spark.implicits._
as well. 您还必须
import spark.implicits._
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.