简体   繁体   English

在 scala 中加入多个数据帧

[英]Join multiple dataframes in scala

I have two variables.我有两个变量。 One is a Dataframe and other is a List[Dataframe].一个是 Dataframe,另一个是 List[Dataframe]。 I wish to perform a join on these.我希望对这些进行连接。 At the moment I am using the following appoach:目前我正在使用以下方法:

def joinDfList(SingleDataFrame: DataFrame, DataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = {

    var joinedDf = SingleDataFrame
    DataFrameList.foreach(
      Df => {
        joinedDf = joinedDf.join(Df, groupByCols, "left_outer")
      }
    )
    joinedDf.na.fill(0.0)
}

Is there an approach where we can skip usage of "var" and instead of "foreach" use "foldleft"?有没有一种方法可以让我们跳过使用“var”而不是“foreach”使用“foldleft”?

You can simple write it without vars using foldLeft :您可以使用foldLeft简单地编写它而不使用 vars:

def joinDfList(singleDataFrame: DataFrame, dataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = 
  dataFrameList.foldLeft(singleDataFrame)(
    (dfAcc, nextDF) => dfAcc.join(nextDF, groupByCols, "left_outer")
  ).na.fill(0.0)

in this code dfAcc will be always join with new DataFrame from dataFrameList and in the end you will get one DataFrame在此代码中, dfAcc将始终与来自dataFrameList的新DataFrame连接,最后您将获得一个DataFrame

Important!重要的! be careful, using too many joins in one job might be a reason of performance degradation.请注意,在一项作业中使用太多连接可能是性能下降的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM