I was searching and figuring out the best way to join n
Spark dataframes.
Example List(df1,df2,df3,dfN)
where all df
have a date where I can join on.
Recursion ?
像这样:
List(df1,df2,df3,dfN).reduce((a, b) => a.join(b, joinCondition))
I am writing the same answer as above for pyspark users.
from functools import reduce
from pyspark.sql.functions import coalesce
dfslist #list of all dataframes that you want to join
mergedDf = reduce(lambda df1,df2 : df1.join(df2, [df1.joinKey == df2.joinKey ], "outer").select("*", coalesce(df1.joinKey, df2.joinKey).alias("joinKey")).drop(df1.joinKey ).drop(df2.joinKey ), dfslist )
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.