[英]How can I loop through a Spark data frame
How can I loop through a Spark data frame? 如何遍历Spark数据帧? I have a data frame that consists of: 我有一个包含以下内容的数据框:
time, id, direction
10, 4, True //here 4 enters --> (4,)
20, 5, True //here 5 enters --> (4,5)
34, 5, False //here 5 leaves --> (4,)
67, 6, True //here 6 enters --> (4,6)
78, 6, False //here 6 leaves --> (4,)
99, 4, False //here 4 leaves --> ()
it is sorted by time and now I would like to step through and accumulate the valid ids. 它按时间排序,现在我要逐步了解并累积有效ID。 The ids enter on direction==True and exit on direction==False 这些ID在direction == True上输入,并在direction == False上退出
so the resulting RDD should look like this 所以生成的RDD应该看起来像这样
time, valid_ids
(10, (4,))
(20, (4,5))
(34, (4,))
(67, (4,6))
(78, (4,)
(99, ())
I know that this will not parallelize, but the df is not that big. 我知道这不会并行化,但是df并没有那么大。 So how could this be done in Spark/Scala? 那么如何在Spark / Scala中完成呢?
If data is small (" but the df is not that big ") I'd just collect and process using Scala collections. 如果数据很小(“ 但df没那么大 ”),我将使用Scala集合进行收集和处理。 If types are as shown below: 如果类型如下所示:
df.printSchema
root
|-- time: integer (nullable = false)
|-- id: integer (nullable = false)
|-- direction: boolean (nullable = false)
you can collect: 您可以收集:
val data = df.as[(Int, Int, Boolean)].collect.toSeq
and scanLeft
: 和scanLeft
:
val result = data.scanLeft((-1, Set[Int]())){
case ((_, acc), (time, value, true)) => (time, acc + value)
case ((_, acc), (time, value, false)) => (time, acc - value)
}.tail
Use of var
is not recommended for scala developers but still I am posting answer using var
使用var
,不建议斯卡拉开发商却依然我使用发布答案var
var collectArray = Array.empty[Int]
df.rdd.collect().map(row => {
if(row(2).toString.equalsIgnoreCase("true")) collectArray = collectArray :+ row(1).asInstanceOf[Int]
else collectArray = collectArray.drop(1)
(row(0), collectArray.toList)
})
this should give you result as 这应该给你结果
(10,List(4))
(20,List(4, 5))
(34,List(5))
(67,List(5, 6))
(78,List(6))
(99,List())
Suppose the name of the respective data frame is someDF
, then do: 假设各个数据帧的名称为someDF
,则执行以下操作:
val df1 = someDF.rdd.collect.iterator;
while(df1.hasNext)
{
println(df1.next);
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.