[英]How to count distinct based on a condition over a window aggregation in PySpark?
这是我拥有的数据样本 dataframe:
from pyspark.sql.functions import *
from pyspark.sql.types import StringType, IntegerType, DateType, StructType, StructField
from datetime import datetime
from pyspark.sql import Window
data2 = [
(datetime.strptime("2020/12/29", "%Y/%m/%d"), "Store B", "Product 1", 0),
(datetime.strptime("2020/12/29", "%Y/%m/%d"), "Store B", "Product 2", 1),
(datetime.strptime("2020/12/31", "%Y/%m/%d"), "Store A", "Product 2", 1),
(datetime.strptime("2020/12/31", "%Y/%m/%d"), "Store A", "Product 3", 1),
(datetime.strptime("2021/01/01", "%Y/%m/%d"), "Store A", "Product 1", 1),
(datetime.strptime("2021/01/01", "%Y/%m/%d"), "Store A", "Product 2", 3),
(datetime.strptime("2021/01/01", "%Y/%m/%d"), "Store A", "Product 3", 2),
(datetime.strptime("2021/01/01", "%Y/%m/%d"), "Store B", "Product 1", 10),
(datetime.strptime("2021/01/01", "%Y/%m/%d"), "Store B", "Product 2", 15),
(datetime.strptime("2021/01/01", "%Y/%m/%d"), "Store B", "Product 3", 9),
(datetime.strptime("2021/01/02", "%Y/%m/%d"), "Store A", "Product 1", 0),
(datetime.strptime("2021/01/03", "%Y/%m/%d"), "Store A", "Product 2", 2)
]
schema = StructType([ \
StructField("date",DateType(),True), \
StructField("store",StringType(),True), \
StructField("product",StringType(),True), \
StructField("stock_c", IntegerType(), True)
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
root
|-- date: date (nullable = true)
|-- store: string (nullable = true)
|-- product: string (nullable = true)
|-- stock_c: integer (nullable = true)
+----------+-------+---------+-------+
|date |store |product |stock_c|
+----------+-------+---------+-------+
|2020-12-29|Store B|Product 1|0 |
|2020-12-29|Store B|Product 2|1 |
|2020-12-31|Store A|Product 2|1 |
|2020-12-31|Store A|Product 3|1 |
|2021-01-01|Store A|Product 1|1 |
|2021-01-01|Store A|Product 2|3 |
|2021-01-01|Store A|Product 3|2 |
|2021-01-01|Store B|Product 1|10 |
|2021-01-01|Store B|Product 2|15 |
|2021-01-01|Store B|Product 3|9 |
|2021-01-02|Store A|Product 1|0 |
|2021-01-03|Store A|Product 2|2 |
+----------+-------+---------+-------+
列stock_c
表示商店中产品的累计库存。
我想创建两个新列,其中一个告诉我商店有或过去有多少产品。 这很简单。 我需要的另一列是那家商店当天有库存的产品数量,这是我无法解决的地方。
这是我使用的代码:
windowStore = Window.partitionBy("store").orderBy("date")
df \
.withColumn("num_products", approx_count_distinct("product").over(windowStore)) \
.withColumn("num_products_with_stock", approx_count_distinct(when(col("stock_c") > 0, col("product"))).over(windowStore)) \
.show()
这就是我得到的:
+----------+-------+---------+-------+------------+-----------------------+
| date| store| product|stock_c|num_products|num_products_with_stock|
+----------+-------+---------+-------+------------+-----------------------+
|2020-12-31|Store A|Product 2| 1| 2| 2|
|2020-12-31|Store A|Product 3| 1| 2| 2|
|2021-01-01|Store A|Product 1| 1| 3| 3|
|2021-01-01|Store A|Product 2| 3| 3| 3|
|2021-01-01|Store A|Product 3| 2| 3| 3|
|2021-01-02|Store A|Product 1| 0| 3| 3|
|2021-01-03|Store A|Product 2| 2| 3| 3|
|2020-12-29|Store B|Product 1| 0| 2| 1|
|2020-12-29|Store B|Product 2| 1| 2| 1|
|2021-01-01|Store B|Product 1| 10| 3| 3|
|2021-01-01|Store B|Product 2| 15| 3| 3|
|2021-01-01|Store B|Product 3| 9| 3| 3|
+----------+-------+---------+-------+------------+-----------------------+
这就是我想要得到的:
+----------+-------+---------+-------+------------+-----------------------+
| date| store| product|stock_c|num_products|num_products_with_stock|
+----------+-------+---------+-------+------------+-----------------------+
|2020-12-31|Store A|Product 2| 1| 2| 2|
|2020-12-31|Store A|Product 3| 1| 2| 2|
|2021-01-01|Store A|Product 1| 1| 3| 3|
|2021-01-01|Store A|Product 2| 3| 3| 3|
|2021-01-01|Store A|Product 3| 2| 3| 3|
|2021-01-02|Store A|Product 1| 0| 3| 2|
|2021-01-03|Store A|Product 2| 2| 3| 2|
|2020-12-29|Store B|Product 1| 0| 2| 1|
|2020-12-29|Store B|Product 2| 1| 2| 1|
|2021-01-01|Store B|Product 1| 10| 3| 3|
|2021-01-01|Store B|Product 2| 15| 3| 3|
|2021-01-01|Store B|Product 3| 9| 3| 3|
+----------+-------+---------+-------+------------+-----------------------+
关键在于这两行,因为产品 1 没有更多库存,那么它应该反映出您只有 2 种产品有库存(产品 2 和产品 3)。
|2021-01-02|Store A|Product 1| 0| 3| 2|
|2021-01-03|Store A|Product 2| 2| 3| 2|
我怎样才能达到我想要的?
提前致谢。
您可以在下面找到我用来解决num_products_with_stock
列问题的代码。 基本上我创建了一个新的条件列,当stock_c
为 0 时将 Product 替换为None
。在一天结束时,我使用了一个非常接近的代码,就像你使用的那样,但是在我创建的这个新列上做了F.approx_count_distinct
。
from pyspark.sql import functions as F
from pyspark.sql import Window as W
window1 = W.partitionBy("store").orderBy("date")
window2 = W.partitionBy(["store", "date"]).orderBy("date")
df = (df
.withColumn("num_products", F.approx_count_distinct("product").over(window1))
.withColumn('hasItem', F.when(F.col('stock_c') > 0, F.col('product')).otherwise(None))
.withColumn("num_products_with_stock", F.approx_count_distinct(F.col("hasItem")).over(window2))
.drop('hasItem')
)
df.show()
希望这能解决您的问题!
我终于在@danimille 的帮助下解决了这个问题
首先,我完成了缺失的日期,然后使用名为has_stock
的辅助列计算了有库存的产品数量:
from datetime import timedelta
from pyspark.sql.types import ArrayType, TimestampType
def dates_between(t1, t2):
return [t1 + timedelta(days=x) for x in range(0, int((t2-t1).days) + 1)]
dates_between_udf = udf(dates_between, ArrayType(TimestampType()))
date_filler = (
df
.withColumn('date', to_timestamp(to_date('date'))) # Ñapa de las gordas
.withColumn("max_date", max("date").over(Window.partitionBy("store")))
.withColumn("min_date", min("date").over(Window.partitionBy("store")))
.withColumn("products", collect_set("product").over(Window.partitionBy("store")))
.withColumn("dates", dates_between_udf(col("min_date"), col("max_date")))
.select("store", "products", "dates")
.distinct()
.withColumn("product", explode("products"))
.withColumn("date", explode("dates"))
.drop("products", "dates")
)
(
df
.join(date_filler, on = ["store", "product", "date"], how = "full")
.withColumn(
"stock_c",
last("stock_c", ignorenulls=True).over(Window.partitionBy("store", "product").orderBy(col("date")))
)
.na.fill(0, "stock_c")
.withColumn("num_products", approx_count_distinct("product").over(windowStore))
.withColumn("has_stock", when(col("stock_c") > 0, 1).otherwise(0))
.withColumn("num_products_with_stock", sum("has_stock").over(Window.partitionBy("store", "date")))
.show()
)
结果如下:
+-------+---------+-------------------+-------+------------+-----------------------+---------+
| store| product| date|stock_c|num_products|num_products_with_stock|has_stock|
+-------+---------+-------------------+-------+------------+-----------------------+---------+
|Store A|Product 1|2020-12-31 00:00:00| 0| 3| 2| 0|
|Store A|Product 2|2020-12-31 00:00:00| 1| 3| 2| 1|
|Store A|Product 3|2020-12-31 00:00:00| 1| 3| 2| 1|
|Store A|Product 1|2021-01-01 00:00:00| 1| 3| 3| 1|
|Store A|Product 2|2021-01-01 00:00:00| 3| 3| 3| 1|
|Store A|Product 3|2021-01-01 00:00:00| 2| 3| 3| 1|
|Store A|Product 1|2021-01-02 00:00:00| 0| 3| 2| 0|
|Store A|Product 2|2021-01-02 00:00:00| 3| 3| 2| 1|
|Store A|Product 3|2021-01-02 00:00:00| 2| 3| 2| 1|
|Store A|Product 1|2021-01-03 00:00:00| 0| 3| 2| 0|
|Store A|Product 2|2021-01-03 00:00:00| 2| 3| 2| 1|
|Store A|Product 3|2021-01-03 00:00:00| 2| 3| 2| 1|
|Store B|Product 1|2020-12-29 00:00:00| 0| 3| 1| 0|
|Store B|Product 2|2020-12-29 00:00:00| 1| 3| 1| 1|
|Store B|Product 3|2020-12-29 00:00:00| 0| 3| 1| 0|
|Store B|Product 1|2020-12-30 00:00:00| 0| 3| 1| 0|
|Store B|Product 2|2020-12-30 00:00:00| 1| 3| 1| 1|
|Store B|Product 3|2020-12-30 00:00:00| 0| 3| 1| 0|
|Store B|Product 1|2020-12-31 00:00:00| 0| 3| 1| 0|
|Store B|Product 2|2020-12-31 00:00:00| 1| 3| 1| 1|
+-------+---------+-------------------+-------+------------+-----------------------+---------+
only showing top 20 rows
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.