簡體   English   中英

在 Dataframe 內部循環中使用過濾器創建列 Pyspark

[英]Create Columns in Dataframe Inside Loop With Filters Pyspark

我想為列表“weeks”中的每個元素創建列,並將它們全部放在一個 dataframe 中。Dataframe “df”根據“weeknum”進行過濾,然后創建列。 在它運行時,但結尾 dataframe 僅包含有關最后一個“weeknum”的信息。 如何為所有加入的“weeknum”創建列?

我試過這個:

weeks = [24, 25]
for weeknum in weeks:
    df_new = df.filter(df.week == weeknum).groupBy(['gender', 'pro']).pivot("share").agg(first('forecast_units')) \
        .withColumnRenamed('0.01', 'units_1_share_wk'+str(weeknum))\
        .withColumnRenamed('0.1', 'units_10_share_wk'+str(weeknum))\
        .withColumnRenamed('0.15', 'units_15_share_wk'+str(weeknum))\
        .withColumnRenamed('0.2', 'units_20_share_wk'+str(weeknum)) 
df_new.show()

但這只會返回 dataframe,其中最后一個“周數”為“周”。

原來的 dataframe "df" 看起來像這樣:


|country|gender|order_date|         pro|share|        prediction|week|dayofweek|forecast_units|
+-------+------+----------+------------+-------------+------------------+----+---------+-------------------+
| ES|  Male|2022-09-15|Jeans - Flat|         0.01|13.322306632995605|  37|        5|               93.0|
| ES|  Male|2022-09-15|Jeans - Flat|          0.1| 19.09369468688965|  37|        5|              134.0|
| ES|  Male|2022-09-15|Jeans - Flat|         0.15|22.504554748535156|  37|        5|              158.0|

我希望結尾 dataframe 具有以下結構:

|gender|pro|units_1_tpr_wk24|units_10_tpr_wk24|units_15_tpr_wk24|units_20_tpr_wk24|units_1_tpr_wk25|units_10_tpr_wk25|units_15_tpr_wk25|units_20_tpr_wk25|

預計 Output:

|gender|pro|units_1_tpr_wk24|units_10_tpr_wk24|units_15_tpr_wk24|units_20_tpr_wk24|units_1_tpr_wk25|units_10_tpr_wk25|units_15_tpr_wk25|units_20_tpr_wk25|
|---+---+---+---+---+---+---+---+---+---+|
|Female|Belts|28.0|0.0|0.0|0.0|28.0|0.0|0.0|0.0|
|Female|Dress|0.0|44.0|0.0|0.0|0.0|0.0|0.0|0.0|
|Male|Belts|0.0|0.0|33.0|0.0|28.0|0.0|0.0|0.0|
|Male|Suits|0.0|0.0|0.0|34.0|0.0|0.0|0.0|0.0|

我建議首先生成所有必需的列,然后將其傳遞到select function 中,如下所示:

from pyspark.sql.functions import col

weeks = [24, 25]
cols_to_select = []
for weeknum in weeks:
    cols_to_select.extend([
        col('0.01').alias(f'units_1_share_wk{weeknum}'),
        col('0.1').alias(f'units_10_share_wk{weeknum}'),
        col('0.15').alias(f'units_15_share_wk{weeknum}'),
        col('0.2').alias(f'units_20_share_wk{weeknum}')
    ])

df.filter(df.week == weeknum).groupBy(['gender', 'pro']).pivot("share").agg(first('forecast_units')).select([col("gender"), col("pro")] + cols_to_select)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM