如何在PySpark中並行執行多行多列操作，且循環最少？

Question

我想在Pyspark中執行具有較少或沒有循環的多行多列操作。 Spark'df'具有以下數據

city    time    temp    humid
NewYork 1500    67      57
NewYork 1600    69      55
NewYork 1700    70      56
Dallas  1500    47      37
Dallas  1600    49      35
Dallas  1700    50      39

我使用了“ For”循環，但以並行性為代價，並且效率不高。

city_list = [i.city for i in df.select('city').distinct().collect()]
metric_cols = ['temp', 'humid']
for city in city_list:
    for metric in metric_cols:
        tempDF = df.filter(col("city") == city)
        metric_values = [(i[metric]) for i in tempDF.select(metric).collect()]
        time_values = [(i['time']) for i in tempDF.select('time').collect()]
        tuples = list(zip(time_values, metric_values))
        newColName = city + metric
        df = df.withColumn(newColName, lit(tuples))

我也不認為它起作用。

我希望輸出是

city    time  temp  humid timetemp                         timehumidity
NewYork 1500  67    57    [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
NewYork 1600  69    55    [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
NewYork 1700  70    56    [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
Dallas  1500  47    37    [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]
Dallas  1600  49    35    [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]
Dallas  1700  50    39    [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]

或至少

city     timetemp                         timehumidity
NewYork  [(1500,67),(1600,69),(1700,70)]  [(1500,57),(1600,55),(1700,56)]
Dallas   [(1500,47),(1600,49),(1700,50)]  [(1500,37),(1600,35),(1700,39)]

Answer 1

一種選擇是使用struct函數：

import pyspark.sql.functions as F
df.groupby('city').agg(F.collect_list(F.struct(F.col('time'),F.col('temp'))).alias('timetemp'), F.collect_list(F.struct(F.col('time'),F.col('humid'))).alias('timehumidity')).show(2, False)

輸出：

+-------+------------------------------------+------------------------------------+
|city   |timetemp                            |timehumidity                        |
+-------+------------------------------------+------------------------------------+
|Dallas |[[1500, 47], [1600, 49], [1700, 50]]|[[1500, 37], [1600, 35], [1700, 39]]|
|NewYork|[[1500, 67], [1600, 69], [1700, 70]]|[[1500, 57], [1600, 55], [1700, 56]]|
+-------+------------------------------------+------------------------------------+

您可以將其與原始數據框一起加入。
如果要將結果作為元組，則可能需要編寫自己的udf 。

您還可以定義列列表並處理更多列集：

list_1 = ['time']
list_2 = ['temp', 'humid'] #change these accordingly

df_array = [df.groupby('city').agg((F.collect_list(F.struct(F.col(x),F.col(y)))).alias(x+y)) for x in list_1 for y in list_2]
for df_temp in df_array:
    df = df.join(df_temp, on='city', how='left')
df.show()

輸出：

+-------+----+----+-----+------------------------------------+------------------------------------+
|city   |time|temp|humid|timetemp                            |timehumid                           |
+-------+----+----+-----+------------------------------------+------------------------------------+
|Dallas |1500|47  |37   |[[1500, 47], [1600, 49], [1700, 50]]|[[1500, 37], [1600, 35], [1700, 39]]|
|Dallas |1600|49  |35   |[[1500, 47], [1600, 49], [1700, 50]]|[[1500, 37], [1600, 35], [1700, 39]]|
|Dallas |1700|50  |39   |[[1500, 47], [1600, 49], [1700, 50]]|[[1500, 37], [1600, 35], [1700, 39]]|
|NewYork|1500|67  |57   |[[1500, 67], [1600, 69], [1700, 70]]|[[1500, 57], [1600, 55], [1700, 56]]|
|NewYork|1600|69  |55   |[[1500, 67], [1600, 69], [1700, 70]]|[[1500, 57], [1600, 55], [1700, 56]]|
|NewYork|1700|70  |56   |[[1500, 67], [1600, 69], [1700, 70]]|[[1500, 57], [1600, 55], [1700, 56]]|
+-------+----+----+-----+------------------------------------+------------------------------------+

Answer 2

在PySpark中找到了性能更高的解決方案

def create_tuples(df):
    mycols = ("temp","humid")
    lcols = mcols.copy()
    lcols.append("time")
    for lcol in lcols:
        df = df.select("*",collect_list(lcol).over(Window.partitionBy("city")).alias(lcol+'_list'))
    for mycol in mycols:
        df = df.withColumn(mycol+'_tuple', arrays_zip("time_list", mycol+'_list'))
    return df
tuples_df = create_tuples(df)

如何在PySpark中並行執行多行多列操作，且循環最少？

問題描述

2 個解決方案

解決方案1
2 2019-07-30 06:47:41

解決方案2
1 已采納 2019-09-05 03:53:39

如何在PySpark中並行執行多行多列操作，且循環最少？

問題描述

2 個解決方案

解決方案1 2 2019-07-30 06:47:41

解決方案2 1 已采納 2019-09-05 03:53:39

解決方案1
2 2019-07-30 06:47:41

解決方案2
1 已采納 2019-09-05 03:53:39