Dataframe上的Spark-SQL窗口函数-查找组中的第一个时间戳

Question

I have below dataframe (say UserData). 我有下面的数据框（比如说UserData）。

uid region  timestamp
a   1   1
a   1   2
a   1   3
a   1   4
a   2   5
a   2   6
a   2   7
a   3   8
a   4   9
a   4   10
a   4   11
a   4   12
a   1   13
a   1   14
a   3   15
a   3   16
a   5   17
a   5   18
a   5   19
a   5   20

This data is nothing but user (uid) travelling across different regions (region) at different time (timestamp). 该数据不过是用户（uid）在不同时间（时间戳）跨不同区域（region）传播的信息。 Presently, timestamp is shown as 'int' for simplicity. 目前，为简单起见，时间戳显示为“ int”。 Note that above dataframe will not be necessarily in increasing order of timestamp. 请注意，上述数据帧不一定按时间戳顺序递增。 Also, there may be some rows in between from different users. 另外，不同用户之间可能会有一些行。 I have shown dataframe for single user only in monotonically incrementing order of timestamp for simplicity. 为了简单起见，我仅以时间戳的单调递增顺序显示了单个用户的数据帧。

My goal is - to find User 'a' spent how much time in each region and in what order? 我的目标是-查找用户“ a”在每个区域以什么顺序花费了多少时间？ So My final expected output looks like 所以我最终的预期输出看起来像

uid region  regionTimeStart regionTimeEnd
a   1   1   5
a   2   5   8
a   3   8   9
a   4   9   13
a   1   13  15
a   3   15  17
a   5   17  20

Based on my findings, Spark SQL Window functions can be used for this purpose. 根据我的发现，Spark SQL Window函数可用于此目的。 I have tried below things, 我在下面尝试过

val w = Window
  .partitionBy("region")
  .partitionBy("uid")
  .orderBy("timestamp")

val resultDF = UserData.select(
  UserData("uid"), UserData("timestamp"),
  UserData("region"), rank().over(w).as("Rank"))

But here onwards, I am not sure on how to get regionTimeStart and regionTimeEnd columns. 但是从现在开始，我不确定如何获取regionTimeStart和regionTimeEnd列。 regionTimeEnd column is nothing but 'lead' of regionTimeStart except the last entry in group. regionTimeEnd柱也不过是“铅” regionTimeStart除了在小组中的最后一项。

I see Aggregate operations have 'first' and 'last' functions but for that I need to group data based on ('uid','region') which spoils monotonically increasing order of path traversed ie at time 13,14 user has come back to region '1' and I want that to be retained instead of clubbing it with initial region '1' at time 1. 我看到聚合操作具有“第一个”和“最后一个”功能，但是为此，我需要基于（'uid'，'region'）对数据进行分组，这破坏了遍历路径的单调递增顺序，即在时间13,14，用户回来了到“ 1”区域，我希望保留该区域，而不是将其与时间1的初始区域“ 1”合并。

It would be very helpful if anyone one can guide me. 如果有人可以指导我，那将非常有帮助。 I am new to Spark and I have better understanding of Scala Spark APIs compared to Python/JAVA Spark APIs. 我是Spark的新手，与Python / JAVA Spark API相比，我对Scala Spark API有更好的了解。

Answer 1

Window functions are indeed useful although your approach can work only if you assume that user visits given region only once. 窗口功能确实有用，尽管您的方法仅在假定用户仅访问给定区域一次的情况下才有效。 Also window definition you use is incorrect - multiple calls to partitionBy simply return new objects with different window definitions. 同样，您使用的窗口定义也不正确-多次调用partitionBy只需返回具有不同窗口定义的新对象。 If you want to partition by multiple columns you should pass them in a single call ( .partitionBy("region", "uid") ). 如果.partitionBy("region", "uid")多个列进行分区，则应在一个调用中传递它们（ .partitionBy("region", "uid") ）。

Lets start with marking continuous visits in each region: 让我们从标记每个区域的连续访问开始：

import org.apache.spark.sql.functions.{lag, sum, not}
import org.apache.spark.sql.expressions.Window 

val w = Window.partitionBy($"uid").orderBy($"timestamp")

val change = (not(lag($"region", 1).over(w) <=> $"region")).cast("int")
val ind = sum(change).over(w)

val dfWithInd = df.withColumn("ind", ind)

Next you we simply aggregate over the groups and find leads: 接下来，我们简单地汇总各组并找到潜在客户：

import org.apache.spark.sql.functions.{lead, coalesce}

val regionTimeEnd = coalesce(lead($"timestamp", 1).over(w), $"max_")

val result = dfWithInd
  .groupBy($"uid", $"region", $"ind")
  .agg(min($"timestamp").alias("timestamp"), max($"timestamp").alias("max_"))
  .drop("ind")
  .withColumn("regionTimeEnd", regionTimeEnd)
  .withColumnRenamed("timestamp", "regionTimeStart")
  .drop("max_")

result.show

// +---+------+---------------+-------------+
// |uid|region|regionTimeStart|regionTimeEnd|
// +---+------+---------------+-------------+
// |  a|     1|              1|            5|
// |  a|     2|              5|            8|
// |  a|     3|              8|            9|
// |  a|     4|              9|           13|
// |  a|     1|             13|           15|
// |  a|     3|             15|           17|
// |  a|     5|             17|           20|
// +---+------+---------------+-------------+

Dataframe上的Spark-SQL窗口函数-查找组中的第一个时间戳

问题描述

1 个解决方案

解决方案1
2 2016-02-10 14:31:50

Dataframe上的Spark-SQL窗口函数-查找组中的第一个时间戳

问题描述

1 个解决方案

解决方案1 2 2016-02-10 14:31:50

解决方案1
2 2016-02-10 14:31:50