[英]How to calculate unique session id per user per minute in a click stream dataset using Spark-SQL?
想象一下,我們有一個包含數百萬行的點擊流數據集。 我們想要計算每個用戶每分鍾的唯一會話 ID。 樣本數據集:
+------+-------------------+
|userId|clicktime |
+------+-------------------+
|1039 |2009-04-21 13:17:50|
|1039 |2009-04-21 13:17:59|
|1039 |2009-04-21 13:19:59|
|1038 |2009-05-21 13:17:50|
|1037 |2009-05-21 13:17:00|
|1037 |2009-05-21 13:17:50|
|1037 |2009-05-21 13:17:59|
|1037 |2009-05-21 13:19:59|
|1038 |2009-05-21 13:19:59|
|1039 |2009-04-21 13:20:50|
+------+-------------------+
我已經在 Spark-Scala 中編寫了一個代碼來解決這個問題,但它不是具有數百萬行數據集的最佳解決方案。 我想要一個比我已經實施的更好的解決方案。 下面是我的實現的源代碼:
val dfWithLag = rawData
.withColumn("lag", lag(col("clicktime"), 1)
.over(Window.partitionBy("userId") orderBy ("clicktime")).cast("timestamp"))
.withColumn("lag_diff", unix_timestamp($"clicktime") - unix_timestamp($"lag"))
.withColumn("lag_diff", when(col("lag_diff").isNull, 0).otherwise(col("lag_diff")))
.orderBy("userId", "clicktime")
val finalDf = dfWithLag.repartition(col("userId")).mapPartitions(partition => {
var sessionId = scala.util.Random
var currentSessionId = sessionId.nextInt().toInt
val newPartition = partition
.map(record => {
ClickStream(record.getInt(0),record.getTimestamp(1), record.getTimestamp(2),
record.getLong(3), {
val timeDiff = record.getLong(3)
if (timeDiff > 60) {
currentSessionId = sessionId.nextInt.toInt
currentSessionId
}
else if (timeDiff == 0) currentSessionId
else currentSessionId
}
)
}).toList
newPartition.iterator
})
(Encoders.product[ClickStream])
rawData.show(false)
finalDf.drop("lag").drop("lagDiff").show(false)
Output的代碼:
+------+-------------------+-----------+
|userId|clickTime |sessionId |
+------+-------------------+-----------+
|1037 |2009-05-21 13:17:00|1049786501 |
|1037 |2009-05-21 13:17:50|1049786501 |
|1037 |2009-05-21 13:17:59|1049786501 |
|1037 |2009-05-21 13:19:59|-1649908351|
|1039 |2009-04-21 13:17:50|-1794290301|
|1039 |2009-04-21 13:17:59|-1794290301|
|1039 |2009-04-21 13:19:59|668855070 |
|1039 |2009-04-21 13:20:50|668855070 |
|1038 |2009-05-21 13:17:50|1149727960 |
|1038 |2009-05-21 13:19:59|-95969967 |
+------+-------------------+-----------+
您可以使用date_format
簡單地獲取日期而無需秒,然后通過hash
function 創建您唯一的 sessionId。 在您的示例中,您使用scala.util.Random
沒有種子,因此您的 sessionId 可能不是唯一的。
df.withColumn("sessionId", hash('userId, date_format('clicktime,"yyyy-MM-dd HH:mm"))).show()
+------+-------------------+
|userId| clicktime|
+------+-------------------+
| 1039|2009-04-21 13:17:50|
| 1039|2009-04-21 13:17:59|
| 1039|2009-04-21 13:19:59|
| 1038|2009-05-21 13:17:50|
| 1037|2009-05-21 13:17:00|
| 1037|2009-05-21 13:17:50|
| 1037|2009-05-21 13:17:59|
| 1037|2009-05-21 13:19:59|
| 1038|2009-05-21 13:19:59|
| 1039|2009-04-21 13:20:50|
+------+-------------------+
+------+-------------------+-----------+
|userId| clicktime| sessionId|
+------+-------------------+-----------+
| 1039|2009-04-21 13:17:50|-1768577078|
| 1039|2009-04-21 13:17:59|-1768577078|
| 1039|2009-04-21 13:19:59| -443001140|
| 1038|2009-05-21 13:17:50| 1660590339|
| 1037|2009-05-21 13:17:00| 1360561347|
| 1037|2009-05-21 13:17:50| 1360561347|
| 1037|2009-05-21 13:17:59| 1360561347|
| 1037|2009-05-21 13:19:59| 925508976|
| 1038|2009-05-21 13:19:59| 1148270137|
| 1039|2009-04-21 13:20:50| 1342597130|
+------+-------------------+-----------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.