簡體   English   中英

Spark Scala按一列分組,將另一列分成列表

[英]Spark scala group by one column breaking another column into list


|  user | music | listen_time         |
|   A   |   m   | 2019-07-01 16:00:00 |
|   A   |   n   | 2019-07-01 16:05:00 |
|   A   |   x   | 2019-07-01 16:10:00 |
|   A   |   y   | 2019-07-01 17:10:00 |
|   A   |   z   | 2019-07-02 18:10:00 |
|   A   |   m   | 2019-07-02 18:15:00 |
|   B   |   t   | 2019-07-02 18:15:00 |
|   B   |   s   | 2019-07-02 18:20:00 |


|  user | music_list |
|   A   |   m, n, x  |
|   A   |      y     |
|   A   |    z, m    |
|   B   |    t, s    |

我如何在Scala Spark DataFrame中實現它?


df.groupBy($"user", window($"listen_time", "30 minutes")).agg(collect_list($"music"))


|user|window                                    |collect_list(music)|
|A   |[2019-07-01 16:00:00, 2019-07-01 16:30:00]|[m, n, x]          |
|B   |[2019-07-02 18:00:00, 2019-07-02 18:30:00]|[t, s]             |
|A   |[2019-07-02 18:00:00, 2019-07-02 18:30:00]|[z, m]             |
|A   |[2019-07-01 17:00:00, 2019-07-01 17:30:00]|[y]                |

結果相似但不完全相同。 collect_list之后使用concat_ws ,然后可以獲得m, n, x


val data = Seq(("A", "m", "2019-07-01 16:00:00"),
("A", "n", "2019-07-01 16:05:00"),
("A", "x", "2019-07-01 16:10:00"),
("A", "y", "2019-07-01 17:10:00"),
("A", "z", "2019-07-02 18:10:00"),
("A", "m", "2019-07-02 18:15:00"),
("B", "t", "2019-07-02 18:15:00"),
("B", "s", "2019-07-02 18:20:00"))

val getinterval = udf((time: Long) => {
(time / 1800) * 1800

val df = data.toDF("user", "music", "listen")
.withColumn("unixtime", unix_timestamp(col("listen")))
.withColumn("interval", getinterval(col("unixtime")))

 val res = df.groupBy(col("user"), col("interval"))



  • 當這是一個新會話時,創建一個帶有文字1的“ newSession”列(如果我理解得很好,則30分鍾以上,沒有lsitening音樂)
  • 只需將文字加起來即可創建會話ID 1
  • 新創建的GroupBy會話ID和用戶。



import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.expressions.Window

// Create the data
// Here we use unix time, this is easier to check for the 30 minuts difference.
val df = Seq(("A", "m", "2019-07-01 16:00:00"),
("A", "n", "2019-07-01 16:05:00"),
("A", "x", "2019-07-01 16:10:00"),
("A", "y", "2019-07-01 17:10:00"),
("A", "z", "2019-07-02 18:10:00"),
("A", "m", "2019-07-02 18:15:00"),
("B", "t", "2019-07-02 18:15:00"),
("B", "s", "2019-07-02 18:20:00")).toDF("user", "music", "listen").withColumn("unix", F.unix_timestamp($"listen", "yyyy-MM-dd HH:mm:ss"))

// The window on which we will lag over to define a new session
val userSessionWindow = Window.partitionBy("user").orderBy("unix")

// This will put a one in front of each new session. The condition changes according to how you define a "new session"
val newSession = ('unix > lag('unix, 1).over(userSessionWindow) + 30*60).cast("bigint")

val dfWithNewSession = df.withColumn("newSession", newSession).na.fill(1)
|user|music|             listen|      unix|newSession|
|   B|    t|2019-07-02 18:15:00|1562084100|         1|
|   B|    s|2019-07-02 18:20:00|1562084400|         0|
|   A|    m|2019-07-01 16:00:00|1561989600|         1|
|   A|    n|2019-07-01 16:05:00|1561989900|         0|
|   A|    x|2019-07-01 16:10:00|1561990200|         0|
|   A|    y|2019-07-01 17:10:00|1561993800|         1|
|   A|    z|2019-07-02 18:10:00|1562083800|         1|
|   A|    m|2019-07-02 18:15:00|1562084100|         0|

// To define a session id to each user, we just need to do a cumulative sum on users' new Session

val userWindow = Window.partitionBy("user").orderBy("unix")
val dfWithSessionId = dfWithNewSession.na.fill(1).withColumn("session", sum("newSession").over(userWindow))

|user|music|             listen|      unix|newSession|session|
|   B|    t|2019-07-02 18:15:00|1562084100|         1|      1|
|   B|    s|2019-07-02 18:20:00|1562084400|         0|      1|
|   A|    m|2019-07-01 16:00:00|1561989600|         1|      1|
|   A|    n|2019-07-01 16:05:00|1561989900|         0|      1|
|   A|    x|2019-07-01 16:10:00|1561990200|         0|      1|
|   A|    y|2019-07-01 17:10:00|1561993800|         1|      2|
|   A|    z|2019-07-02 18:10:00|1562083800|         1|      3|
|   A|    m|2019-07-02 18:15:00|1562084100|         0|      3|

val dfFinal = dfWithSessionId.groupBy("user", "session").agg(F.collect_list("music").as("music")).select("user", "music").show


|user|    music|
|   B|   [t, s]|
|   A|[m, n, x]|
|   A|      [y]|
|   A|   [z, m]|


聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

粵ICP備18138465號  © 2020-2024 STACKOOM.COM