我如何确定我的 Foundry 工作中偏斜任务的价值？

Question

I've looked into my job and have identified that I do indeed have a skewed task .我调查了我的工作，发现我确实有一项有偏差的任务。 How do I determine what the actual value is inside this task that is causing the skew?我如何确定导致偏斜的任务中的实际值是多少？

My Python Transforms code looks like this:我的 Python 转换代码如下所示：

from transforms.api import Input, Output, transform


@transform(
  ...
)
def my_compute_function(...):
  ...
  df = df.join(df_2, ["joint_col"])
  ...

Answer 1

Theory理论

Skew problems originate from anything that causes an exchange in your job.偏斜问题源于任何导致您工作交换的事情。 Things that cause exchanges include but are not limited to: join s, window s, groupBy s.引起交换的东西包括但不限于： join 、 window 、 groupBy 。

These operations result in data movement across your Executors based upon the found values inside the DataFrames used.这些操作会导致数据根据所使用的 DataFrame 中找到的值在 Executors 之间移动。 This means that when a used DataFrame has many repeated values on the column dictating the exchange, those rows all end up in the same task, thus increasing its size.这意味着当使用过的 DataFrame 在指示交换的列上有许多重复值时，这些行最终都在同一个任务中，从而增加了它的大小。

Example例子

Let's consider the following example distribution of data for your join:让我们为您的联接考虑以下示例数据分布：

DataFrame 1 (df1)

| col_1 | col_2 |
|-------|-------|
| key_1 | 1     |
| key_1 | 2     |
| key_1 | 3     |
| key_1 | 1     |
| key_1 | 2     |
| key_2 | 1     |

DataFrame 2 (df2)

| col_1 | col_2 |
|-------|-------|
| key_1 | 1     |
| key_1 | 2     |
| key_1 | 3     |
| key_1 | 1     |
| key_2 | 2     |
| key_3 | 1     |

These DataFrames when joined together on col_1 will have the following data distributed across the executors:这些数据帧在col_1上连接在一起时，将在执行程序中分布以下数据：

Task 1:任务1：
- Receives: 5 rows of key_1 from df1接收：来自 df1 的 5 行key_1
- Receives: 4 rows of key_1 from df2接收：来自 df2 的 4 行key_1
- Total Input: 9 rows of data sent to task_1总输入：发送到 task_1 的 9 行数据
- Result: 5 * 4 = 20 rows of output data结果：5 * 4 = 20 行 output 数据
Task 2:任务 2：
- Receives: 1 row of key_2 from df1接收：来自 df1 的 1 行key_2
- Receives: 1 row of key_2 from df2接收：来自 df2 的 1 行key_2
- Total Input: 2 rows of data sent to task_2总输入：发送到 task_2 的 2 行数据
- Result: 1 * 1 = 1 rows of output data结果：1 * 1 = 1行output数据
Task 3:任务 3：
- Receives: 1 row of key_3 from df2接收：来自 df2 的 1 行key_3
- Total Input: 1 rows of data sent to task_3总输入：发送到 task_3 的 1 行数据
- Result: 1 * 0 = 0 rows of output data (missed key; no key found in df1)结果：1 * 0 = 0 行 output 数据（缺少键；在 df1 中找不到键）

If you therefore look at the counts of input and output rows per task, you'll see that Task 1 has far more data than the others.因此，如果您查看每个任务的输入计数和 output 行，您会发现任务 1 的数据远多于其他任务。 This task is skewed .这个任务是倾斜的。

Identification鉴别

The question now becomes how we identify that key_1 is the culprit of the skew since this isn't visible in Spark (the underlying engine powering the join).现在的问题是我们如何确定key_1是偏差的罪魁祸首，因为这在 Spark（支持连接的底层引擎）中是不可见的。

If we look at the above example, we see that all we need to know is the actual counts per key of the joint column .如果我们看上面的例子，我们会发现我们需要知道的只是联合列的每个键的实际计数。 This means we can:这意味着我们可以：

Aggregate each side of the join on the joint key and count the rows per key在联合键上聚合连接的每一侧并计算每个键的行数
Multiply the counts of each side of the join to determine the output row counts将联接每一侧的计数相乘以确定 output 行计数

The easiest way to do this is by opening the Analysis (Contour) tool in Foundry and performing the following analysis:最简单的方法是在 Foundry 中打开分析（轮廓）工具并执行以下分析：

Add df1 as input to a first path将df1作为输入添加到第一条路径
Add Pivot Table board, using col_1 as the rows, and Row count as the aggregate添加Pivot Table board，使用col_1作为行， Row count作为聚合
Click the ⇄ Switch to pivoted data button单击⇄ Switch to pivoted data按钮
Use the Multi-Column Editor board to keep only col_1 and the COUNT column.使用Multi-Column Editor板只保留col_1和COUNT列。 Prefix each of them with df1_ , resulting in an output from the path which is only df1_col_1 and df1_COUNT .为它们中的每一个添加前缀df1_ ，从只有df1_col_1和 df1_COUNT 的路径中产生df1_COUNT 。
Add df2 as input to a second path将df2作为输入添加到第二条路径
Add Pivot Table board, again using col_1 as the rows, and Row count as the aggregate添加Pivot Table board，再次使用col_1作为行， Row count作为聚合
Click the ⇄ Switch to pivoted data button单击⇄ Switch to pivoted data按钮
Use the Multi-Column Editor board to keep only col_1 and the COUNT column.使用Multi-Column Editor板只保留col_1和COUNT列。 Prefix each of them with df2_ , resulting in an output from the path which is only df2_col_1 and df2_COUNT .用df2_为它们中的每一个添加前缀，从只有df2_col_1和 df2_COUNT 的路径中产生df2_COUNT 。
Create a third path, using the result of the first path ( df1_col_1 and df1_COUNT1 )使用第一条路径（ df1_col_1和df1_COUNT1 ）的结果创建第三条路径
Add a Join board, making the right side of the join the result of the second path ( df2_col_1 and df2_col_1 ).添加一个连接板，使Join的右侧成为第二条路径（ df2_col_1和df2_col_1 ）的结果。 Ensure the join type is Full join确保连接类型是Full join
Add all columns from the right side (you don't need to add a prefix, all the columns are unique从右侧开始添加所有列（不需要加前缀，所有列都是唯一的
Configure the join board to join on df1_col_1 equals df2_col_1配置加入板以加入df1_col_1等于df2_col_1
Add an Expression board to create a new column, output_row_count which multiplies the two COUNT columns together添加一个Expression板以创建一个新列output_row_count ，它将两个COUNT列相乘
Add a Sort board that sorts on output_row_count descending添加一个按output_row_count降序排序的Sort板
If you now preview the resultant data, you will have a sorted list of keys from both sides of the join that are causing the skew如果您现在预览结果数据，您将得到一个来自连接两侧的导致偏斜的键的排序列表

我如何确定我的 Foundry 工作中偏斜任务的价值？

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-02-16 21:08:54

Theory理论

Example例子

Identification鉴别

我如何确定我的 Foundry 工作中偏斜任务的价值？

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-02-16 21:08:54

Theory理论

Example例子

Identification鉴别

解决方案1
2 已采纳 2022-02-16 21:08:54