简体   繁体   中英

How to join dataframes (from a collection of Datasets)?

I was searching and figuring out the best way to join n Spark dataframes.

Example List(df1,df2,df3,dfN) where all df have a date where I can join on.

Recursion ?

像这样:

List(df1,df2,df3,dfN).reduce((a, b) => a.join(b, joinCondition))

I am writing the same answer as above for pyspark users.

from functools import reduce
from pyspark.sql.functions import coalesce
dfslist #list of all dataframes that you want to join
mergedDf = reduce(lambda df1,df2 : df1.join(df2, [df1.joinKey == df2.joinKey ], "outer").select("*", coalesce(df1.joinKey, df2.joinKey).alias("joinKey")).drop(df1.joinKey ).drop(df2.joinKey ), dfslist )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM