[英]Join multiple dataframes in scala
I have two variables.我有两个变量。 One is a Dataframe and other is a List[Dataframe].
一个是 Dataframe,另一个是 List[Dataframe]。 I wish to perform a join on these.
我希望对这些进行连接。 At the moment I am using the following appoach:
目前我正在使用以下方法:
def joinDfList(SingleDataFrame: DataFrame, DataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = {
var joinedDf = SingleDataFrame
DataFrameList.foreach(
Df => {
joinedDf = joinedDf.join(Df, groupByCols, "left_outer")
}
)
joinedDf.na.fill(0.0)
}
Is there an approach where we can skip usage of "var" and instead of "foreach" use "foldleft"?有没有一种方法可以让我们跳过使用“var”而不是“foreach”使用“foldleft”?
You can simple write it without vars using foldLeft
:您可以使用
foldLeft
简单地编写它而不使用 vars:
def joinDfList(singleDataFrame: DataFrame, dataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame =
dataFrameList.foldLeft(singleDataFrame)(
(dfAcc, nextDF) => dfAcc.join(nextDF, groupByCols, "left_outer")
).na.fill(0.0)
in this code dfAcc
will be always join with new DataFrame
from dataFrameList
and in the end you will get one DataFrame
在此代码中,
dfAcc
将始终与来自dataFrameList
的新DataFrame
连接,最后您将获得一个DataFrame
Important!重要的! be careful, using too many joins in one job might be a reason of performance degradation.
请注意,在一项作业中使用太多连接可能是性能下降的原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.