如何使用数据框的数据创建聚合列，然后使用 pyspark 中的另一个 dataframe 扩展行？

Question

I have a data frame that gives me funding for various products at different levels.我有一个数据框，可以为不同级别的各种产品提供资金。 This is a wide data frame that shows funding from 2021-Jan-01 to 2021-Dec-31 ( Funding_Start_date and Funding_End_Date format yyyyMMdd )这是一个宽数据框，显示从 2021 年 1 月 1 日到 2021 年 12 月 31 日的资金（ Funding_Start_date和Funding_End_Date格式yyyyMMdd ）

funding_data = [
    (20210101,20211231,"Family","Cars","Audi","A4", 420.0, 12345, "Lump_Sum", 50000)
  ]

funding_schema = StructType([ \
    StructField("Funding_Start_Date",IntegerType(),True), \
    StructField("Funding_End_Date",IntegerType(),True), \
    StructField("Funding_Level",StringType(),True), \
    StructField("Type", StringType(), True), \
    StructField("Brand", StringType(), True), \
    StructField("Brand_Low", StringType(), True), \
    StructField("Family", FloatType(), True), \
    StructField("SKU_ID", IntegerType(), True), \
    StructField("Allocation_Basis", StringType(), True), \
    StructField("Amount", IntegerType(), True) \
  ])

funding_df = spark.createDataFrame(data=funding_data,schema=funding_schema)
funding_df.show()

+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+
|Funding_Start_Date|Funding_End_Date|Funding_Level|Type|Brand|Brand_Low|Family|SKU_ID|Allocation_Basis|Amount|
+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+
|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum|50000|
+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+

I want to have a row for each day of funding with a per-day Amount depending on the following factor:我想为每天的资金划一排，每天的Amount取决于以下因素：

a sale has been made on that day at that Funding_Level当天在该Funding_Level进行了一笔销售

I have a sales table at a Date and SKU level.我有一个日期和 SKU 级别的销售表。

sales_data = [
    (20210105,352210,"Cars","Audi","A4", 420.0, 1),
    (20210106,352207,"Cars","Audi","A4", 420.0, 5),
    (20210106,352196,"Cars","Audi","A4", 420.0, 2),
    (20210109,352212,"Cars","Audi","A4", 420.0, 3),
    (20210112,352212,"Cars","Audi","A4", 420.0, 1),
    (20210112,352212,"Cars","Audi","A4", 420.0, 2),
    (20210112,352212,"Cars","BMW","X6", 325.0, 2),
    (20210126,352196,"Cars","Audi","A4", 420.0, 1),
    

  ]

sales_schema = StructType([ \
    StructField("DATE_ID",IntegerType(),True), \
    StructField("SKU_ID",IntegerType(),True), \
    StructField("Type",StringType(),True), \
    StructField("Brand", StringType(), True), \
    StructField("Brand_Low", StringType(), True), \
    StructField("Family", FloatType(), True),
    StructField("Quantity", IntegerType(), True)
  ])

sales_df = spark.createDataFrame(data=sales_data,schema=sales_schema)
sales_df.show()

+--------+------+----+-----+---------+------+--------+
| DATE_ID|SKU_ID|Type|Brand|Brand_Low|Family|Quantity|
+--------+------+----+-----+---------+------+--------+
|20210105|352210|Cars| Audi|       A4| 420.0|       1|
|20210106|352207|Cars| Audi|       A4| 420.0|       5|
|20210106|352196|Cars| Audi|       A4| 420.0|       2|
|20210109|352212|Cars| Audi|       A4| 420.0|       3|
|20210112|352212|Cars| Audi|       A4| 420.0|       1|
|20210112|352212|Cars| Audi|       A4| 420.0|       2|
|20210112|352212|Cars|  BMW|       X6| 325.0|       2|
|20210126|352196|Cars| Audi|       A4| 420.0|       1|
+--------+------+----+-----+---------+------+--------+

This would tell me there are 5 unique days when a Product with a column Family of 420.0 has been sold.这将告诉我有 5 个独特的日子，当产品的列Family为 420.0 已售出。

sales_df.filter(col('Family') == 420.0).select('DATE_ID').distinct().show()

+--------+
| DATE_ID|
+--------+
|20210112|
|20210109|
|20210105|
|20210106|
|20210126|
+--------+

So the Lumpsum/Day would be 50000/5 = 10000所以 Lumpsum Lumpsum/Day将是 50000/5 = 10000

So I'm trying to get a final data frame like this:所以我试图得到一个像这样的最终数据框：

+--------+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+-----------+
| DATE_ID|Funding_Start_Date|Funding_End_Date|Funding_Level|Type|Brand|Brand_Low|Family|SKU_ID|Allocation_Basis|Amount|Lumpsum/Day|
+--------+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+-----------+
|20210105|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|      10000|
+--------+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+-----------+
|20210106|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|      10000|
+--------+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+-----------+
|20210109|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|      10000|
+--------+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+-----------+
|20210112|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|      10000|
+--------+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+-----------+
|20210126|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|      10000|
+--------+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+-----------+

I've tried UDF's but I wasn't able to pass the sales_df in it to count the days and divide it by the Lump_Sum amount as UDF's don't accept data frames.我已经尝试过 UDF，但我无法通过其中的sales_df来计算天数并将其除以Lump_Sum数量，因为 UDF 不接受数据帧。

How do I get to this final data frame from the above two data frames?如何从上述两个数据帧中得到这个最终数据帧？

Answer 1

To find the Lumpsum/Day per family and Funding_Start_Date and Funding_End_Date :要查找每个family的一次性付款Lumpsum/Day以及Funding_Start_Date和Funding_End_Date ：

Convert the Funding_Start_Date , Funding_End_Date and DATE_ID to DateType .将Funding_Start_Date 、 Funding_End_Date和DATE_ID转换为DateType 。
Select distinct DATE_ID and Family from sales_df . Select DATE_ID和Family与sales_df不同。
Join funding_df and sales_df such that the DATE_ID is between Funding_Start_Date and Funding_End_Date and the Family are same.加入funding_df和sales_df使得DATE_ID在Funding_Start_Date和Funding_End_Date之间并且Family相同。
Apply count window aggregation over Funding_Start_Date , Funding_End_Date and Family to find number of days with sales.对Funding_Start_Date 、 Funding_End_Date和Family应用count window 聚合以查找销售天数。
Divide Amount with result from step 4 to arrive at Lumpsum/Day .将Amount与第 4 步的结果相除以得出Lumpsum/Day 。

from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql import Window

funding_data = [
    (20210101,20211231,"Family","Cars","Audi","A4", 420.0, 12345, "Lump_Sum", 50000)
  ]

funding_schema = StructType([ \
    StructField("Funding_Start_Date",IntegerType(),True), \
    StructField("Funding_End_Date",IntegerType(),True), \
    StructField("Funding_Level",StringType(),True), \
    StructField("Type", StringType(), True), \
    StructField("Brand", StringType(), True), \
    StructField("Brand_Low", StringType(), True), \
    StructField("Family", FloatType(), True), \
    StructField("SKU_ID", IntegerType(), True), \
    StructField("Allocation_Basis", StringType(), True), \
    StructField("Amount", IntegerType(), True) \
  ])

funding_df = spark.createDataFrame(data=funding_data,schema=funding_schema)

# STEP 1
funding_df = (funding_df.withColumn("Funding_Start_Date", F.to_date(F.col("Funding_Start_Date").cast("string"), "yyyyMMdd"))
                        .withColumn("Funding_End_Date", F.to_date(F.col("Funding_End_Date").cast("string"), "yyyyMMdd")))

sales_data = [
    (20210105,352210,"Cars","Audi","A4", 420.0, 1),
    (20210106,352207,"Cars","Audi","A4", 420.0, 5),
    (20210106,352196,"Cars","Audi","A4", 420.0, 2),
    (20210109,352212,"Cars","Audi","A4", 420.0, 3),
    (20210112,352212,"Cars","Audi","A4", 420.0, 1),
    (20210112,352212,"Cars","Audi","A4", 420.0, 2),
    (20210112,352212,"Cars","BMW","X6", 325.0, 2),
    (20210126,352196,"Cars","Audi","A4", 420.0, 1),
    

  ]

sales_schema = StructType([ \
    StructField("DATE_ID",IntegerType(),True), \
    StructField("SKU_ID",IntegerType(),True), \
    StructField("Type",StringType(),True), \
    StructField("Brand", StringType(), True), \
    StructField("Brand_Low", StringType(), True), \
    StructField("Family", FloatType(), True),
    StructField("Quantity", IntegerType(), True)
  ])

sales_df = spark.createDataFrame(data=sales_data,schema=sales_schema)

# STEP 1
sales_df = sales_df.withColumn("DATE_ID", F.to_date(F.col("DATE_ID").cast("string"), "yyyyMMdd"))

# STEP 2
sales_df = sales_df.select("DATE_ID", "Family").distinct()

# STEP 3
joined_df = funding_df.join(sales_df, (sales_df["DATE_ID"].between(funding_df["Funding_Start_Date"], funding_df["Funding_End_Date"]) & (funding_df["Family"] == sales_df["Family"])))
joined_df = joined_df.select(*[funding_df[c] for c in funding_df.columns], "DATE_ID")

# STEP 4 and 5
ws = Window.partitionBy("Funding_Start_Date", "Funding_End_Date", "Family").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

(joined_df.withColumn("Lumpsum/Day", F.col("amount") / (F.count("DATE_ID").over(ws)))
          .withColumn("Funding_Start_Date", F.date_format("Funding_Start_Date", "yyyyMMdd").cast("int"))
          .withColumn("Funding_End_Date", F.date_format("Funding_End_Date", "yyyyMMdd").cast("int"))
          .withColumn("DATE_ID", F.date_format("DATE_ID", "yyyyMMdd").cast("int"))
).show()

Output Output

+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+--------+-----------+
|Funding_Start_Date|Funding_End_Date|Funding_Level|Type|Brand|Brand_Low|Family|SKU_ID|Allocation_Basis|Amount| DATE_ID|Lumpsum/Day|
+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+--------+-----------+
|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|20210106|    10000.0|
|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|20210112|    10000.0|
|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|20210126|    10000.0|
|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|20210105|    10000.0|
|          20210101|        20211231|       Family|Cars| Audi|       A4| 420.0| 12345|        Lump_Sum| 50000|20210109|    10000.0|
+------------------+----------------+-------------+----+-----+---------+------+------+----------------+------+--------+-----------+

如何使用数据框的数据创建聚合列，然后使用 pyspark 中的另一个 dataframe 扩展行？

问题描述

1 个解决方案

解决方案1
0 2022-01-07 10:02:30

Output Output

如何使用数据框的数据创建聚合列，然后使用 pyspark 中的另一个 dataframe 扩展行？

问题描述

1 个解决方案

解决方案1 0 2022-01-07 10:02:30

Output Output

解决方案1
0 2022-01-07 10:02:30