简体   繁体   中英

Spark Sql data from org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]

I have org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] data,
how to print data or get data?

I have code like:

val sessionsDF = Seq(("day1","user1","session1", 100.0),
  ("day1","user1","session2",200.0),
  ("day2","user1","session3",300.0),
  ("day2","user1","session4",400.0),
  ("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal").toDF()

val groupByData=sessionsDF.groupBy(x=>(x.get(0),x.get(1))
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)

The above code is returning org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]

In your first step, you have .toDF() extra. Correct one is as below

val sessionsDF = Seq(("day1","user1","session1", 100.0),
  ("day1","user1","session2",200.0),
  ("day2","user1","session3",300.0),
  ("day2","user1","session4",400.0),
  ("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal")

In your second step, you missed .rdd so the actual second step is

val groupByData=sessionsDF.rdd.groupBy(x=>(x.get(0),x.get(1)))

which has dataType as you mentioned in the question as

scala> groupByData
res12: org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] = ShuffledRDD[9] at groupBy at <console>:25

to view the groupByData rdd you can simply use foreach as

groupByData.foreach(println)

which would give you

((day1,user1),CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0]))
((day2,user1),CompactBuffer([day2,user1,session3,300.0], [day2,user1,session4,400.0], [day2,user1,session4,99.0]))

Now your third step is filtering data which has day1 as value for day column in your dataframe . And you are taking only the values of the grouped rdd data.

val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)

the returned dataType for this step is

scala> filterData
res13: org.apache.spark.rdd.RDD[Iterable[org.apache.spark.sql.Row]] = MapPartitionsRDD[11] at map at <console>:27

You can use foreach as above to view the data as

filterData.foreach(println)

which would give you

CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0])

You can see that the returned dataType is an RDD[Iterable[org.apache.spark.sql.Row]] so you can print each values using a map as

filterData.map(x => x.map(y => println(y(0), y(1), y(2), y(3)))).collect

which would give you

(day1,user1,session1,100.0)
(day1,user1,session2,200.0)

if you do only

filterData.map(x => x.map(y => println(y(0), y(3)))).collect

you would get

(day1,100.0)
(day1,200.0)

I hope the answer is helpful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM