[英]Spark Sql data from org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
I have org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
data, 我有
org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
数据,
how to print data or get data? 如何打印数据或获取数据?
I have code like: 我有这样的代码:
val sessionsDF = Seq(("day1","user1","session1", 100.0),
("day1","user1","session2",200.0),
("day2","user1","session3",300.0),
("day2","user1","session4",400.0),
("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal").toDF()
val groupByData=sessionsDF.groupBy(x=>(x.get(0),x.get(1))
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)
The above code is returning org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
上面的代码返回
org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])]
In your first step, you have .toDF()
extra. 第一步,您需要额外的
.toDF()
。 Correct one is as below 正确的一个如下
val sessionsDF = Seq(("day1","user1","session1", 100.0),
("day1","user1","session2",200.0),
("day2","user1","session3",300.0),
("day2","user1","session4",400.0),
("day2","user1","session4",99.0)
).toDF("day","userId","sessionId","purchaseTotal")
In your second step, you missed .rdd
so the actual second step is 在第二步中,您错过了
.rdd
因此实际的第二步是
val groupByData=sessionsDF.rdd.groupBy(x=>(x.get(0),x.get(1)))
which has dataType as you mentioned in the question as 如您在问题中提到的那样具有dataType
scala> groupByData
res12: org.apache.spark.rdd.RDD[((Any, Any), Iterable[org.apache.spark.sql.Row])] = ShuffledRDD[9] at groupBy at <console>:25
to view the groupByData
rdd
you can simply use foreach
as 要查看
groupByData
rdd
您可以简单地使用foreach
作为
groupByData.foreach(println)
which would give you 这会给你
((day1,user1),CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0]))
((day2,user1),CompactBuffer([day2,user1,session3,300.0], [day2,user1,session4,400.0], [day2,user1,session4,99.0]))
Now your third step is filtering data which has day1
as value for day column
in your dataframe
. 现在,您的第三步是过滤将
day1
值作为day column
值的dataframe
。 And you are taking only the values of the grouped rdd
data. 而且,您仅采用分组的
rdd
数据的值。
val filterData=groupByData.filter(x=>x._1._1=="day1").map(x=>x._2)
the returned dataType for this step is 此步骤返回的dataType是
scala> filterData
res13: org.apache.spark.rdd.RDD[Iterable[org.apache.spark.sql.Row]] = MapPartitionsRDD[11] at map at <console>:27
You can use foreach
as above to view the data as 您可以使用上面的
foreach
来查看数据
filterData.foreach(println)
which would give you 这会给你
CompactBuffer([day1,user1,session1,100.0], [day1,user1,session2,200.0])
You can see that the returned dataType is an RDD[Iterable[org.apache.spark.sql.Row]]
so you can print each values using a map
as 您可以看到返回的dataType是
RDD[Iterable[org.apache.spark.sql.Row]]
因此您可以使用map
将每个值打印为
filterData.map(x => x.map(y => println(y(0), y(1), y(2), y(3)))).collect
which would give you 这会给你
(day1,user1,session1,100.0)
(day1,user1,session2,200.0)
if you do only 如果你只做
filterData.map(x => x.map(y => println(y(0), y(3)))).collect
you would get 你会得到
(day1,100.0)
(day1,200.0)
I hope the answer is helpful 我希望答案是有帮助的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.