[英]How do I identify the value of a skewed task of my Foundry job?
I've looked into my job and have identified that I do indeed have a skewed task .我调查了我的工作, 发现我确实有一项有偏差的任务。 How do I determine what the actual value is inside this task that is causing the skew?
我如何确定导致偏斜的任务中的实际值是多少?
My Python Transforms code looks like this:我的 Python 转换代码如下所示:
from transforms.api import Input, Output, transform
@transform(
...
)
def my_compute_function(...):
...
df = df.join(df_2, ["joint_col"])
...
Skew problems originate from anything that causes an exchange in your job.偏斜问题源于任何导致您工作交换的事情。 Things that cause exchanges include but are not limited to:
join
s, window
s, groupBy
s.引起交换的东西包括但不限于:
join
、 window
、 groupBy
。
These operations result in data movement across your Executors based upon the found values inside the DataFrames used.这些操作会导致数据根据所使用的 DataFrame 中找到的值在 Executors 之间移动。 This means that when a used DataFrame has many repeated values on the column dictating the exchange, those rows all end up in the same task, thus increasing its size.
这意味着当使用过的 DataFrame 在指示交换的列上有许多重复值时,这些行最终都在同一个任务中,从而增加了它的大小。
Let's consider the following example distribution of data for your join:让我们为您的联接考虑以下示例数据分布:
DataFrame 1 (df1)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_1 | 2 |
| key_2 | 1 |
DataFrame 2 (df2)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_2 | 2 |
| key_3 | 1 |
These DataFrames when joined together on col_1
will have the following data distributed across the executors:这些数据帧在
col_1
上连接在一起时,将在执行程序中分布以下数据:
key_1
from df1key_1
key_1
from df2key_1
key_2
from df1key_2
key_2
from df2key_2
key_3
from df2key_3
If you therefore look at the counts of input and output rows per task, you'll see that Task 1 has far more data than the others.因此,如果您查看每个任务的输入计数和 output 行,您会发现任务 1 的数据远多于其他任务。 This task is skewed .
这个任务是倾斜的。
The question now becomes how we identify that key_1
is the culprit of the skew since this isn't visible in Spark (the underlying engine powering the join).现在的问题是我们如何确定
key_1
是偏差的罪魁祸首,因为这在 Spark(支持连接的底层引擎)中是不可见的。
If we look at the above example, we see that all we need to know is the actual counts per key of the joint column .如果我们看上面的例子,我们会发现我们需要知道的只是联合列的每个键的实际计数。 This means we can:
这意味着我们可以:
The easiest way to do this is by opening the Analysis (Contour) tool in Foundry and performing the following analysis:最简单的方法是在 Foundry 中打开分析(轮廓)工具并执行以下分析:
Add df1
as input to a first path将
df1
作为输入添加到第一条路径
Add Pivot Table
board, using col_1
as the rows, and Row count
as the aggregate添加
Pivot Table
board,使用col_1
作为行, Row count
作为聚合
Click the ⇄ Switch to pivoted data
button单击
⇄ Switch to pivoted data
按钮
Use the Multi-Column Editor
board to keep only col_1
and the COUNT
column.使用
Multi-Column Editor
板只保留col_1
和COUNT
列。 Prefix each of them with df1_
, resulting in an output from the path which is only df1_col_1
and df1_COUNT
.为它们中的每一个添加前缀
df1_
,从只有df1_col_1
和 df1_COUNT 的路径中产生df1_COUNT
。
Add df2
as input to a second path将
df2
作为输入添加到第二条路径
Add Pivot Table
board, again using col_1
as the rows, and Row count
as the aggregate添加
Pivot Table
board,再次使用col_1
作为行, Row count
作为聚合
Click the ⇄ Switch to pivoted data
button单击
⇄ Switch to pivoted data
按钮
Use the Multi-Column Editor
board to keep only col_1
and the COUNT
column.使用
Multi-Column Editor
板只保留col_1
和COUNT
列。 Prefix each of them with df2_
, resulting in an output from the path which is only df2_col_1
and df2_COUNT
.用
df2_
为它们中的每一个添加前缀,从只有df2_col_1
和 df2_COUNT 的路径中产生df2_COUNT
。
Create a third path, using the result of the first path ( df1_col_1
and df1_COUNT1
)使用第一条路径(
df1_col_1
和df1_COUNT1
)的结果创建第三条路径
Add a Join
board, making the right side of the join the result of the second path ( df2_col_1
and df2_col_1
).添加一个连接板,使
Join
的右侧成为第二条路径( df2_col_1
和df2_col_1
)的结果。 Ensure the join type is Full join
确保连接类型是
Full join
Add all columns from the right side (you don't need to add a prefix, all the columns are unique从右侧开始添加所有列(不需要加前缀,所有列都是唯一的
Configure the join board to join on df1_col_1
equals df2_col_1
配置加入板以加入
df1_col_1
等于df2_col_1
Add an Expression
board to create a new column, output_row_count
which multiplies the two COUNT
columns together添加一个
Expression
板以创建一个新列output_row_count
,它将两个COUNT
列相乘
Add a Sort
board that sorts on output_row_count
descending添加一个按
output_row_count
降序排序的Sort
板
If you now preview the resultant data, you will have a sorted list of keys from both sides of the join that are causing the skew如果您现在预览结果数据,您将得到一个来自连接两侧的导致偏斜的键的排序列表
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.