如何使用 Spark-SQL 在單擊 stream 數據集中計算每個用戶每分鍾唯一的 session id？

Question

想象一下，我們有一個包含數百萬行的點擊流數據集。 我們想要計算每個用戶每分鍾的唯一會話 ID。 樣本數據集：

+------+-------------------+
|userId|clicktime          |
+------+-------------------+
|1039  |2009-04-21 13:17:50|
|1039  |2009-04-21 13:17:59|
|1039  |2009-04-21 13:19:59|
|1038  |2009-05-21 13:17:50|
|1037  |2009-05-21 13:17:00|
|1037  |2009-05-21 13:17:50|
|1037  |2009-05-21 13:17:59|
|1037  |2009-05-21 13:19:59|
|1038  |2009-05-21 13:19:59|
|1039  |2009-04-21 13:20:50|
+------+-------------------+

我已經在 Spark-Scala 中編寫了一個代碼來解決這個問題，但它不是具有數百萬行數據集的最佳解決方案。 我想要一個比我已經實施的更好的解決方案。 下面是我的實現的源代碼：

val dfWithLag = rawData
      .withColumn("lag", lag(col("clicktime"), 1)
        .over(Window.partitionBy("userId") orderBy ("clicktime")).cast("timestamp"))
      .withColumn("lag_diff", unix_timestamp($"clicktime") - unix_timestamp($"lag"))
      .withColumn("lag_diff", when(col("lag_diff").isNull, 0).otherwise(col("lag_diff")))
      .orderBy("userId", "clicktime")


    val finalDf = dfWithLag.repartition(col("userId")).mapPartitions(partition => {
      var sessionId = scala.util.Random
      var currentSessionId = sessionId.nextInt().toInt
      val newPartition = partition
        .map(record => {
          ClickStream(record.getInt(0),record.getTimestamp(1), record.getTimestamp(2),
            record.getLong(3), {
              val timeDiff = record.getLong(3)
              if (timeDiff > 60) {
                currentSessionId = sessionId.nextInt.toInt
                currentSessionId
              }
              else if (timeDiff == 0) currentSessionId
              else currentSessionId
            }
          )
        }).toList
      newPartition.iterator
    })
    (Encoders.product[ClickStream])

    rawData.show(false)
    finalDf.drop("lag").drop("lagDiff").show(false)

Output的代碼：

+------+-------------------+-----------+
|userId|clickTime          |sessionId  |
+------+-------------------+-----------+
|1037  |2009-05-21 13:17:00|1049786501 |
|1037  |2009-05-21 13:17:50|1049786501 |
|1037  |2009-05-21 13:17:59|1049786501 |
|1037  |2009-05-21 13:19:59|-1649908351|
|1039  |2009-04-21 13:17:50|-1794290301|
|1039  |2009-04-21 13:17:59|-1794290301|
|1039  |2009-04-21 13:19:59|668855070  |
|1039  |2009-04-21 13:20:50|668855070  |
|1038  |2009-05-21 13:17:50|1149727960 |
|1038  |2009-05-21 13:19:59|-95969967  |
+------+-------------------+-----------+

Answer 1

您可以使用date_format簡單地獲取日期而無需秒，然后通過hash function 創建您唯一的 sessionId。 在您的示例中，您使用scala.util.Random沒有種子，因此您的 sessionId 可能不是唯一的。

df.withColumn("sessionId", hash('userId, date_format('clicktime,"yyyy-MM-dd HH:mm"))).show()

    +------+-------------------+
    |userId|          clicktime|
    +------+-------------------+
    |  1039|2009-04-21 13:17:50|
    |  1039|2009-04-21 13:17:59|
    |  1039|2009-04-21 13:19:59|
    |  1038|2009-05-21 13:17:50|
    |  1037|2009-05-21 13:17:00|
    |  1037|2009-05-21 13:17:50|
    |  1037|2009-05-21 13:17:59|
    |  1037|2009-05-21 13:19:59|
    |  1038|2009-05-21 13:19:59|
    |  1039|2009-04-21 13:20:50|
    +------+-------------------+

    +------+-------------------+-----------+
    |userId|          clicktime|  sessionId|
    +------+-------------------+-----------+
    |  1039|2009-04-21 13:17:50|-1768577078|
    |  1039|2009-04-21 13:17:59|-1768577078|
    |  1039|2009-04-21 13:19:59| -443001140|
    |  1038|2009-05-21 13:17:50| 1660590339|
    |  1037|2009-05-21 13:17:00| 1360561347|
    |  1037|2009-05-21 13:17:50| 1360561347|
    |  1037|2009-05-21 13:17:59| 1360561347|
    |  1037|2009-05-21 13:19:59|  925508976|
    |  1038|2009-05-21 13:19:59| 1148270137|
    |  1039|2009-04-21 13:20:50| 1342597130|
    +------+-------------------+-----------+

如何使用 Spark-SQL 在單擊 stream 數據集中計算每個用戶每分鍾唯一的 session id？

問題描述

1 個解決方案

解決方案1
0 2019-10-24 08:35:44

如何使用 Spark-SQL 在單擊 stream 數據集中計算每個用戶每分鍾唯一的 session id？

問題描述

1 個解決方案

解決方案1 0 2019-10-24 08:35:44

解決方案1
0 2019-10-24 08:35:44