Spark dataFrame 更新其列后显示时间过长

Question

I have a dataFrame of approx.我有一个大约的数据帧。 4 million rows and 35 columns as input. 400 万行和 35 列作为输入。

All I do to this dataFrame is the following steps:我对这个数据框所做的只是以下步骤：

For a list of given columns, I calculate a sum for a given list of group features and joined it as new column to my input dataFrame对于给定列的列表，我计算给定组特征列表的总和，并将其作为新列加入到我的输入数据帧中
I drop each new column sum right after I joined it to the dataFrame.在将每个新列和加入数据帧后，我立即删除它。

Therefore we end up with the same dataFrame as we started from (in theory).因此，我们最终得到与我们开始时相同的数据帧（理论上）。

However, I noticed that if my list of given columns gets too big (from more than 6 columns), the output dataFrame becomes impossible to manipulate.但是，我注意到如果我的给定列列表变得太大（超过 6 列），则输出 dataFrame 变得无法操作。 Even a simple display takes 10 minutes.即使是简单的显示也需要 10 分钟。

Here is an example of my code (df is my input dataFrame):这是我的代码示例（df 是我的输入数据帧）：

  for c in list_columns:
    df = df.join(df.groupby(list_group_features).agg(sum(c).alias('sum_' + c)), list_group_features)
    df = df.drop('sum_' + c)

Answer 1

This happens due to the inner workings of Spark and its lazy evaluation.这是由于 Spark 的内部工作原理及其惰性求值造成的。

What Spark does when you call groupby , join , agg , it attaches these calls to the plan of the df object.当您调用groupby 、 join 、 agg ，Spark 会做什么，它将这些调用附加到df对象的计划中。 So even though it is not executing anything on the data, you are creating a large execution plan which is internally stored in the Spark DataFrame object.因此，即使它没有对数据执行任何操作，您也正在创建一个内部存储在 Spark DataFrame 对象中的大型执行计划。

Only when you call an action ( show , count , write , etc.), Spark optimizes the plan and executes it.只有当你调用一个动作（ show 、 count 、 write等）时，Spark 才会优化计划并执行它。 If the plan is too large, the optimization step can take a while to perform.如果计划太大，优化步骤可能需要一段时间才能执行。 Also remember that the plan optimization is happening on the driver, not on the executors.还要记住，计划优化发生在驱动程序上，而不是执行程序上。 So if your driver is busy or overloaded, it delays spark plan optimization step as well.因此，如果您的驱动程序很忙或超载，它也会延迟火花计划优化步骤。

It is useful to remember that joins are expensive operations in Spark, both for optimization and execution.记住连接是 Spark 中的昂贵操作，无论是优化还是执行，这都是有用的。 If you can, you should always avoid joins when operating on a single DataFrame and utilise the window functionality instead.如果可以，在单个 DataFrame 上操作时应始终避免连接，而应使用窗口功能。 Joins should only be used if you are joining different dataframes from different sources (different tables).仅当您连接来自不同来源（不同表）的不同数据框时，才应使用连接。

A way to optimize your code would be:优化代码的一种方法是：

import pyspark
import pyspark.sql.functions as f

w = pyspark.sql.Window.partitionBy(list_group_features)
agg_sum_exprs = [f.sum(f.col(c)).alias("sum_" + c).over(w) for c in list_columns]
res_df = df.select(df.columns + agg_sum_exprs)

This should be scalable and fast for large list_group_features and list_columns lists.对于大型list_group_features和list_columns列表，这应该是可扩展且快速的。

Spark dataFrame 更新其列后显示时间过长

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-02-19 10:26:01

Spark dataFrame 更新其列后显示时间过长

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-02-19 10:26:01

解决方案1
4 已采纳 2020-02-19 10:26:01