简体   繁体   English

Spark Scala FoldLeft 性能缓慢

[英]Spark Scala FoldLeft Performance slowness

Hi I am trying to do a scdtype2 update in dataframe having 280 columns.您好我正在尝试在具有 280 列的 dataframe 中进行 scdtype2 更新。

val newYRecs = stgDF.columns
                .foldLeft(joinedDF)
                  {(tempDF,colName) => 
                      tempDF.withColumn("new_" + colName, when(col("stg." + colName).isNull, col("tgt."+ colName)).otherwise(col("stg."  + colName))).drop(col("stg." + colName)).drop(col("tgt." + colName)).withColumnRenamed("new_" + colName,colName) 

This is taking 8 minutes alone to execute.单独执行这需要 8 分钟。 Is there any way this can be optimized?有什么办法可以优化吗?

According to this article, it seems that withColumn has a hidden cost of the Catalyst optimizer that hampers performance when used on many columns.根据这篇文章,似乎withColumn具有 Catalyst 优化器的隐藏成本,当在许多列上使用时会影响性能。 I would try using the proposed workaround and doing something like this (Also while you're at it, you can make your code cleaner with coalesce ):我会尝试使用建议的解决方法并执行类似的操作(当您使用它时,您可以使用coalesce使您的代码更清晰):

val newYRecs = joinedDF.select(stgDF.columns.map{ colName =>
      coalesce(col("stg." + colName), col("tgt."+ colName)) as colName
}: _*)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM