簡體   English   中英

如何為每個其他列創建事件序列(列值)?

[英]How to create a sequence of events (column values) per some other column?

我有一個Spark數據框,如下所示-

val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")


scala> myDF.show
+-------+-------+---------+-------------+------+
|visitor|channel|timestamp|purchase_flag|amount|
+-------+-------+---------+-------------+------+
|      1|      A|      100|            0|     0|
|      1|      E|      200|            0|     0|
|      1|       |      300|            1|    49|
|      2|      A|      200|            0|     0|
|      2|      C|      300|            0|     0|
|      2|      D|      100|            0|     0|
+-------+-------+---------+-------------+------+

我想為myDF中的每個訪客創建Sequence數據myDF ,該框架跟蹤按timestamp維度訂購的訪客的購買路徑。 輸出數據幀應如下所示( ->可以是任何定界符)-

+-------+---------------------+
|visitor|channel sequence     |
+-------+---------------------+
|      1| A->E->purchase      |
|      2| D->A->C->no_purchase|
+-------+---------------------+

為了明確起見,訪問者2已訪問了D通道,然后是A通道,然后是C通道; 而且他沒有購買。 因此,該序列將形成為D->AC->no_purchase

注意:每當發生購買時,通道值將變為blank並且purchase_flag設置為1。

我想在Spark中使用Scala UDF進行此操作,以便將方法重新應用於其他數據集。

這是使用udf函數完成的方法

val myDF = Seq(
  (1,"A",100,0,0),
  (1,"E",200,0,0),
  (1,"",300,1,49),
  (2,"A",200,0,0),
  (2,"C",300,0,0),
  (2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")

import org.apache.spark.sql.functions._
def sequenceUdf = udf((struct: Seq[Row], purchased: Seq[Int])=> struct.map(row => (row.getAs[String]("channel"), row.getAs[Int]("timestamp"))).sortBy(_._2).map(_._1).filterNot(_ == "").mkString("->")+{if(purchased.contains(1)) "->purchase" else "->no_purchase"})

myDF.groupBy("visitor").agg(collect_list(struct("channel", "timestamp")).as("struct"), collect_list("purchase_flag").as("purchased"))
  .select(col("visitor"), sequenceUdf(col("struct"), col("purchased")).as("channel sequence"))
  .show(false)

這應該給你

+-------+--------------------+
|visitor|channel sequence    |
+-------+--------------------+
|1      |A->E->purchase      |
|2      |D->A->C->no_purchase|
+-------+--------------------+

您可以使它盡可能通用。 這只是關於如何進行的演示

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM