[英]How to perform a multi-row multi-column operation in parallel within PySpark, with minimum loops?
我想在Pyspark中執行具有較少或沒有循環的多行多列操作。 Spark'df'具有以下數據
city time temp humid
NewYork 1500 67 57
NewYork 1600 69 55
NewYork 1700 70 56
Dallas 1500 47 37
Dallas 1600 49 35
Dallas 1700 50 39
我使用了“ For”循環,但以並行性為代價,並且效率不高。
city_list = [i.city for i in df.select('city').distinct().collect()]
metric_cols = ['temp', 'humid']
for city in city_list:
for metric in metric_cols:
tempDF = df.filter(col("city") == city)
metric_values = [(i[metric]) for i in tempDF.select(metric).collect()]
time_values = [(i['time']) for i in tempDF.select('time').collect()]
tuples = list(zip(time_values, metric_values))
newColName = city + metric
df = df.withColumn(newColName, lit(tuples))
我也不認為它起作用。
我希望輸出是
city time temp humid timetemp timehumidity
NewYork 1500 67 57 [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
NewYork 1600 69 55 [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
NewYork 1700 70 56 [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
Dallas 1500 47 37 [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]
Dallas 1600 49 35 [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]
Dallas 1700 50 39 [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]
或至少
city timetemp timehumidity
NewYork [(1500,67),(1600,69),(1700,70)] [(1500,57),(1600,55),(1700,56)]
Dallas [(1500,47),(1600,49),(1700,50)] [(1500,37),(1600,35),(1700,39)]
一種選擇是使用struct
函數:
import pyspark.sql.functions as F
df.groupby('city').agg(F.collect_list(F.struct(F.col('time'),F.col('temp'))).alias('timetemp'), F.collect_list(F.struct(F.col('time'),F.col('humid'))).alias('timehumidity')).show(2, False)
輸出:
+-------+------------------------------------+------------------------------------+
|city |timetemp |timehumidity |
+-------+------------------------------------+------------------------------------+
|Dallas |[[1500, 47], [1600, 49], [1700, 50]]|[[1500, 37], [1600, 35], [1700, 39]]|
|NewYork|[[1500, 67], [1600, 69], [1700, 70]]|[[1500, 57], [1600, 55], [1700, 56]]|
+-------+------------------------------------+------------------------------------+
您可以將其與原始數據框一起加入。
如果要將結果作為元組,則可能需要編寫自己的udf
。
您還可以定義列列表並處理更多列集:
list_1 = ['time']
list_2 = ['temp', 'humid'] #change these accordingly
df_array = [df.groupby('city').agg((F.collect_list(F.struct(F.col(x),F.col(y)))).alias(x+y)) for x in list_1 for y in list_2]
for df_temp in df_array:
df = df.join(df_temp, on='city', how='left')
df.show()
輸出:
+-------+----+----+-----+------------------------------------+------------------------------------+
|city |time|temp|humid|timetemp |timehumid |
+-------+----+----+-----+------------------------------------+------------------------------------+
|Dallas |1500|47 |37 |[[1500, 47], [1600, 49], [1700, 50]]|[[1500, 37], [1600, 35], [1700, 39]]|
|Dallas |1600|49 |35 |[[1500, 47], [1600, 49], [1700, 50]]|[[1500, 37], [1600, 35], [1700, 39]]|
|Dallas |1700|50 |39 |[[1500, 47], [1600, 49], [1700, 50]]|[[1500, 37], [1600, 35], [1700, 39]]|
|NewYork|1500|67 |57 |[[1500, 67], [1600, 69], [1700, 70]]|[[1500, 57], [1600, 55], [1700, 56]]|
|NewYork|1600|69 |55 |[[1500, 67], [1600, 69], [1700, 70]]|[[1500, 57], [1600, 55], [1700, 56]]|
|NewYork|1700|70 |56 |[[1500, 67], [1600, 69], [1700, 70]]|[[1500, 57], [1600, 55], [1700, 56]]|
+-------+----+----+-----+------------------------------------+------------------------------------+
在PySpark中找到了性能更高的解決方案
def create_tuples(df):
mycols = ("temp","humid")
lcols = mcols.copy()
lcols.append("time")
for lcol in lcols:
df = df.select("*",collect_list(lcol).over(Window.partitionBy("city")).alias(lcol+'_list'))
for mycol in mycols:
df = df.withColumn(mycol+'_tuple', arrays_zip("time_list", mycol+'_list'))
return df
tuples_df = create_tuples(df)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.