[英]Pyspark count how many times a item occurs in different dates in a dataframe
假设我有一个日期框架,例如
date offer member
2020-01-01 o1 m1
2020-01-01 o2 m1
2020-01-01 o1 m2
2020-01-01 o2 m2
2020-01-02 o1 m3
2020-01-02 o2 m3
2020-01-03 o1 m4
我应该计算一个报价存在多少天
date offer member count
2020-01-01 o1 m1 3
2020-01-01 o2 m1 2
2020-01-01 o1 m2 3
2020-01-01 o2 m2 2
2020-01-02 o1 m3 3
2020-01-02 o2 m3 2
2020-01-03 o1 m4 3
有人可以帮助我如何在 pyspark 中执行此操作,因为我是新手。
val source1DF = Seq(
("2020-01-01", "o1", "m1"),
("2020-01-01", "o2", "m1"),
("2020-01-01", "o1", "m2"),
("2020-01-01", "o2", "m2"),
("2020-01-02", "o1", "m3"),
("2020-01-02", "o2", "m3"),
("2020-01-03", "o1", "m4")
).toDF("date", "offer", "member")
val tmp1DF = source1DF.select('date, 'offer).dropDuplicates()
val tmp2DF = tmp1DF.groupBy("offer").count.alias("count")
val resultDF = source1DF
.join(tmp2DF, source1DF.col("offer") === tmp2DF.col("offer"))
.select(
source1DF.col("date"),
source1DF.col("offer"),
source1DF.col("member"),
tmp2DF.col("count")
)
resultDF.show(false)
// +----------+-----+------+-----+
// |date |offer|member|count|
// +----------+-----+------+-----+
// |2020-01-01|o1 |m1 |3 |
// |2020-01-01|o2 |m1 |2 |
// |2020-01-01|o1 |m2 |3 |
// |2020-01-01|o2 |m2 |2 |
// |2020-01-02|o1 |m3 |3 |
// |2020-01-02|o2 |m3 |2 |
// |2020-01-03|o1 |m4 |3 |
// +----------+-----+------+-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.