![](/img/trans.png)
[英]More efficient way to loop through PySpark DataFrame and create new columns
[英]Create Columns in Dataframe Inside Loop With Filters Pyspark
我想為列表“weeks”中的每個元素創建列,並將它們全部放在一個 dataframe 中。Dataframe “df”根據“weeknum”進行過濾,然后創建列。 在它運行時,但結尾 dataframe 僅包含有關最后一個“weeknum”的信息。 如何為所有加入的“weeknum”創建列?
我試過這個:
weeks = [24, 25]
for weeknum in weeks:
df_new = df.filter(df.week == weeknum).groupBy(['gender', 'pro']).pivot("share").agg(first('forecast_units')) \
.withColumnRenamed('0.01', 'units_1_share_wk'+str(weeknum))\
.withColumnRenamed('0.1', 'units_10_share_wk'+str(weeknum))\
.withColumnRenamed('0.15', 'units_15_share_wk'+str(weeknum))\
.withColumnRenamed('0.2', 'units_20_share_wk'+str(weeknum))
df_new.show()
但這只會返回 dataframe,其中最后一個“周數”為“周”。
原來的 dataframe "df" 看起來像這樣:
|country|gender|order_date| pro|share| prediction|week|dayofweek|forecast_units|
+-------+------+----------+------------+-------------+------------------+----+---------+-------------------+
| ES| Male|2022-09-15|Jeans - Flat| 0.01|13.322306632995605| 37| 5| 93.0|
| ES| Male|2022-09-15|Jeans - Flat| 0.1| 19.09369468688965| 37| 5| 134.0|
| ES| Male|2022-09-15|Jeans - Flat| 0.15|22.504554748535156| 37| 5| 158.0|
我希望結尾 dataframe 具有以下結構:
|gender|pro|units_1_tpr_wk24|units_10_tpr_wk24|units_15_tpr_wk24|units_20_tpr_wk24|units_1_tpr_wk25|units_10_tpr_wk25|units_15_tpr_wk25|units_20_tpr_wk25|
預計 Output:
|gender|pro|units_1_tpr_wk24|units_10_tpr_wk24|units_15_tpr_wk24|units_20_tpr_wk24|units_1_tpr_wk25|units_10_tpr_wk25|units_15_tpr_wk25|units_20_tpr_wk25|
|---+---+---+---+---+---+---+---+---+---+|
|Female|Belts|28.0|0.0|0.0|0.0|28.0|0.0|0.0|0.0|
|Female|Dress|0.0|44.0|0.0|0.0|0.0|0.0|0.0|0.0|
|Male|Belts|0.0|0.0|33.0|0.0|28.0|0.0|0.0|0.0|
|Male|Suits|0.0|0.0|0.0|34.0|0.0|0.0|0.0|0.0|
我建議首先生成所有必需的列,然后將其傳遞到select
function 中,如下所示:
from pyspark.sql.functions import col
weeks = [24, 25]
cols_to_select = []
for weeknum in weeks:
cols_to_select.extend([
col('0.01').alias(f'units_1_share_wk{weeknum}'),
col('0.1').alias(f'units_10_share_wk{weeknum}'),
col('0.15').alias(f'units_15_share_wk{weeknum}'),
col('0.2').alias(f'units_20_share_wk{weeknum}')
])
df.filter(df.week == weeknum).groupBy(['gender', 'pro']).pivot("share").agg(first('forecast_units')).select([col("gender"), col("pro")] + cols_to_select)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.