简体   繁体   English

如何在 PySpark 2.0 中计算不包括周末的日期范围内的累积总和?

[英]How to calculate cumulative sum over date range excluding weekends in PySpark 2.0?

This is an extension to an earlier question I raised here How to calculate difference between dates excluding weekends in PySpark 2.2.0 .这是对我在此处提出的较早问题的扩展如何计算 PySpark 2.2.0 中不包括周末的日期之间的差异 My spark dataframe looks like below and can be generated with the accompanying code:我的 spark dataframe 如下所示,可以使用随附的代码生成:

df = spark.createDataFrame([(1, "John Doe", "2020-11-30",1),(2, "John Doe", "2020-11-27",2),(4, "John Doe", "2020-12-01",0),(5, "John Doe", "2020-10-02",1),\
                          (6, "John Doe", "2020-12-03",1),(7, "John Doe", "2020-12-04",1)],
                            ("id", "name", "date","count"))

+---+--------+----------+-----+
| id|    name|      date|count|
+---+--------+----------+-----+
|  5|John Doe|2020-10-02|    1|
|  2|John Doe|2020-11-27|    2|
|  1|John Doe|2020-11-30|    1|
|  4|John Doe|2020-12-01|    0|
|  6|John Doe|2020-12-03|    1|
|  7|John Doe|2020-12-04|    1|
+---+--------+----------+-----+

I am trying to calculate cumulative sums over a period of 2,3,4,5 & 30 days.我正在尝试计算 2、3、4、5 和 30 天的累积总和。 Below is a sample code for 2 days and the resulting table.下面是 2 天的示例代码和结果表。

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
days = lambda i: i * 86400
windowval_2 = Window.partitionBy("name").orderBy(F.col("date").cast("timestamp").cast("long")).rangeBetween(days(-1), days(0))
windowval_3 = Window.partitionBy("name").orderBy(F.col("date").cast("timestamp").cast("long")).rangeBetween(days(-2), days(0))
windowval_4 = Window.partitionBy("name").orderBy(F.col("date").cast("timestamp").cast("long")).rangeBetween(days(-3), days(0))
df = df.withColumn("cum_sum_2d_temp",F.sum("count").over(windowval_2))


+---+--------+----------+-----+---------------+
| id|    name|      date|count|cum_sum_2d_temp|
+---+--------+----------+-----+---------------+
|  5|John Doe|2020-10-02|    1|              1|
|  2|John Doe|2020-11-27|    2|              2|
|  1|John Doe|2020-11-30|    1|              1|
|  4|John Doe|2020-12-01|    0|              1|
|  6|John Doe|2020-12-03|    1|              1|
|  7|John Doe|2020-12-04|    1|              2|
+---+--------+----------+-----+---------------+

What I am trying to do is when calculating the date range, the calculation excludes weekends ie in my table 2020-11-27 is a Friday and 2020-11-30 is Monday.我想做的是在计算日期范围时,计算不包括周末,即在我的表中 2020-11-27 是星期五,2020-11-30 是星期一。 The diff between them is 1 if we exclude Sat & Sun.如果我们排除周六和周日,它们之间的差异为 1。 I want the cumulative sum of 2020-11-27 and 2020-11-30 values in front of 2020-11-30 in the 'cum_sum_2d_temp' column which should be 3. I am looking to combine the solution to my earlier question to the date range.我想要 2020-11-30 前面的 2020-11-27 和 2020-11-30 值的累积总和在 'cum_sum_2d_temp' 列中应该是 3。我希望将我之前问题的解决方案结合到日期范围。

Calculate the date_dif relative to the earliest date:计算相对于最早日期的 date_dif:

import numpy as np
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType

df = spark.createDataFrame([(1, "John Doe", "2020-11-30",1),(2, "John Doe", "2020-11-27",2),(4, "John Doe", "2020-12-01",0),(5, "John Doe", "2020-10-02",1),\
                          (6, "John Doe", "2020-12-03",1),(7, "John Doe", "2020-12-04",1)],
                            ("id", "name", "date","count"))

workdaysUDF = F.udf(lambda date1, date2: int(np.busday_count(date2, date1)) if (date1 is not None and date2 is not None) else None, IntegerType())
df = df.withColumn("date_dif", workdaysUDF(F.col('date'), F.first(F.col('date')).over(Window.partitionBy('name').orderBy('date'))))

windowval = lambda days: Window.partitionBy('name').orderBy('date_dif').rangeBetween(-days, 0) 
df = df.withColumn("cum_sum",F.sum("count").over(windowval(2)))
df.show()

+---+--------+----------+-----+--------+-------+
| id|    name|      date|count|date_dif|cum_sum|
+---+--------+----------+-----+--------+-------+
|  5|John Doe|2020-10-02|    1|       0|      1|
|  2|John Doe|2020-11-27|    2|      40|      2|
|  1|John Doe|2020-11-30|    1|      41|      3|
|  4|John Doe|2020-12-01|    0|      42|      3|
|  6|John Doe|2020-12-03|    1|      44|      1|
|  7|John Doe|2020-12-04|    1|      45|      2|
+---+--------+----------+-----+--------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM