[英]Spark Scala FoldLeft Performance slowness
Hi I am trying to do a scdtype2 update in dataframe having 280 columns.您好我正在尝试在具有 280 列的 dataframe 中进行 scdtype2 更新。
val newYRecs = stgDF.columns
.foldLeft(joinedDF)
{(tempDF,colName) =>
tempDF.withColumn("new_" + colName, when(col("stg." + colName).isNull, col("tgt."+ colName)).otherwise(col("stg." + colName))).drop(col("stg." + colName)).drop(col("tgt." + colName)).withColumnRenamed("new_" + colName,colName)
This is taking 8 minutes alone to execute.单独执行这需要 8 分钟。 Is there any way this can be optimized?
有什么办法可以优化吗?
According to this article, it seems that withColumn
has a hidden cost of the Catalyst optimizer that hampers performance when used on many columns.根据这篇文章,似乎
withColumn
具有 Catalyst 优化器的隐藏成本,当在许多列上使用时会影响性能。 I would try using the proposed workaround and doing something like this (Also while you're at it, you can make your code cleaner with coalesce
):我会尝试使用建议的解决方法并执行类似的操作(当您使用它时,您可以使用
coalesce
使您的代码更清晰):
val newYRecs = joinedDF.select(stgDF.columns.map{ colName =>
coalesce(col("stg." + colName), col("tgt."+ colName)) as colName
}: _*)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.