简体   繁体   中英

Join multiple dataframes in scala

I have two variables. One is a Dataframe and other is a List[Dataframe]. I wish to perform a join on these. At the moment I am using the following appoach:

def joinDfList(SingleDataFrame: DataFrame, DataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = {

    var joinedDf = SingleDataFrame
    DataFrameList.foreach(
      Df => {
        joinedDf = joinedDf.join(Df, groupByCols, "left_outer")
      }
    )
    joinedDf.na.fill(0.0)
}

Is there an approach where we can skip usage of "var" and instead of "foreach" use "foldleft"?

You can simple write it without vars using foldLeft :

def joinDfList(singleDataFrame: DataFrame, dataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = 
  dataFrameList.foldLeft(singleDataFrame)(
    (dfAcc, nextDF) => dfAcc.join(nextDF, groupByCols, "left_outer")
  ).na.fill(0.0)

in this code dfAcc will be always join with new DataFrame from dataFrameList and in the end you will get one DataFrame

Important! be careful, using too many joins in one job might be a reason of performance degradation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM