繁体   English   中英

Pyspark 计算一个项目在 dataframe 中不同日期出现的次数

[英]Pyspark count how many times a item occurs in different dates in a dataframe

假设我有一个日期框架,例如

date         offer        member
2020-01-01    o1           m1
2020-01-01    o2           m1
2020-01-01    o1           m2
2020-01-01    o2           m2
2020-01-02    o1           m3
2020-01-02    o2           m3
2020-01-03    o1           m4

我应该计算一个报价存在多少天

date         offer        member    count
2020-01-01    o1           m1       3
2020-01-01    o2           m1       2
2020-01-01    o1           m2       3
2020-01-01    o2           m2       2
2020-01-02    o1           m3       3
2020-01-02    o2           m3       2
2020-01-03    o1           m4       3

有人可以帮助我如何在 pyspark 中执行此操作,因为我是新手。

      val source1DF = Seq(
    ("2020-01-01", "o1", "m1"),
    ("2020-01-01", "o2", "m1"),
    ("2020-01-01", "o1", "m2"),
    ("2020-01-01", "o2", "m2"),
    ("2020-01-02", "o1", "m3"),
    ("2020-01-02", "o2", "m3"),
    ("2020-01-03", "o1", "m4")
  ).toDF("date", "offer", "member")

  val tmp1DF = source1DF.select('date, 'offer).dropDuplicates()
  val tmp2DF = tmp1DF.groupBy("offer").count.alias("count")

  val resultDF = source1DF
    .join(tmp2DF, source1DF.col("offer") === tmp2DF.col("offer"))
    .select(
      source1DF.col("date"),
      source1DF.col("offer"),
      source1DF.col("member"),
      tmp2DF.col("count")
    )

  resultDF.show(false)
  //  +----------+-----+------+-----+
  //  |date      |offer|member|count|
  //  +----------+-----+------+-----+
  //  |2020-01-01|o1   |m1    |3    |
  //  |2020-01-01|o2   |m1    |2    |
  //  |2020-01-01|o1   |m2    |3    |
  //  |2020-01-01|o2   |m2    |2    |
  //  |2020-01-02|o1   |m3    |3    |
  //  |2020-01-02|o2   |m3    |2    |
  //  |2020-01-03|o1   |m4    |3    |
  //  +----------+-----+------+-----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM