简体   繁体   English

我如何确定我的 Foundry 工作中偏斜任务的价值?

[英]How do I identify the value of a skewed task of my Foundry job?

I've looked into my job and have identified that I do indeed have a skewed task .我调查了我的工作, 发现我确实有一项有偏差的任务 How do I determine what the actual value is inside this task that is causing the skew?我如何确定导致偏斜的任务中的实际值是多少?

My Python Transforms code looks like this:我的 Python 转换代码如下所示:

from transforms.api import Input, Output, transform


@transform(
  ...
)
def my_compute_function(...):
  ...
  df = df.join(df_2, ["joint_col"])
  ...

Theory理论

Skew problems originate from anything that causes an exchange in your job.偏斜问题源于任何导致您工作交换的事情。 Things that cause exchanges include but are not limited to: join s, window s, groupBy s.引起交换的东西包括但不限于: joinwindowgroupBy

These operations result in data movement across your Executors based upon the found values inside the DataFrames used.这些操作会导致数据根据所使用的 DataFrame 中找到的值在 Executors 之间移动。 This means that when a used DataFrame has many repeated values on the column dictating the exchange, those rows all end up in the same task, thus increasing its size.这意味着当使用过的 DataFrame 在指示交换的列上有许多重复值时,这些行最终都在同一个任务中,从而增加了它的大小。

Example例子

Let's consider the following example distribution of data for your join:让我们为您的联接考虑以下示例数据分布:

DataFrame 1 (df1)

| col_1 | col_2 |
|-------|-------|
| key_1 | 1     |
| key_1 | 2     |
| key_1 | 3     |
| key_1 | 1     |
| key_1 | 2     |
| key_2 | 1     |

DataFrame 2 (df2)

| col_1 | col_2 |
|-------|-------|
| key_1 | 1     |
| key_1 | 2     |
| key_1 | 3     |
| key_1 | 1     |
| key_2 | 2     |
| key_3 | 1     |

These DataFrames when joined together on col_1 will have the following data distributed across the executors:这些数据帧在col_1上连接在一起时,将在执行程序中分布以下数据:

  • Task 1:任务1:
    • Receives: 5 rows of key_1 from df1接收:来自 df1 的 5 行key_1
    • Receives: 4 rows of key_1 from df2接收:来自 df2 的 4 行key_1
    • Total Input: 9 rows of data sent to task_1总输入:发送到 task_1 的 9 行数据
    • Result: 5 * 4 = 20 rows of output data结果:5 * 4 = 20 行 output 数据
  • Task 2:任务 2:
    • Receives: 1 row of key_2 from df1接收:来自 df1 的 1 行key_2
    • Receives: 1 row of key_2 from df2接收:来自 df2 的 1 行key_2
    • Total Input: 2 rows of data sent to task_2总输入:发送到 task_2 的 2 行数据
    • Result: 1 * 1 = 1 rows of output data结果:1 * 1 = 1行output数据
  • Task 3:任务 3:
    • Receives: 1 row of key_3 from df2接收:来自 df2 的 1 行key_3
    • Total Input: 1 rows of data sent to task_3总输入:发送到 task_3 的 1 行数据
    • Result: 1 * 0 = 0 rows of output data (missed key; no key found in df1)结果:1 * 0 = 0 行 output 数据(缺少键;在 df1 中找不到键)

If you therefore look at the counts of input and output rows per task, you'll see that Task 1 has far more data than the others.因此,如果您查看每个任务的输入计数和 output 行,您会发现任务 1 的数据远多于其他任务。 This task is skewed .这个任务是倾斜的

Identification鉴别

The question now becomes how we identify that key_1 is the culprit of the skew since this isn't visible in Spark (the underlying engine powering the join).现在的问题是我们如何确定key_1是偏差的罪魁祸首,因为这在 Spark(支持连接的底层引擎)中是不可见的。

If we look at the above example, we see that all we need to know is the actual counts per key of the joint column .如果我们看上面的例子,我们会发现我们需要知道的只是联合列的每个键的实际计数。 This means we can:这意味着我们可以:

  1. Aggregate each side of the join on the joint key and count the rows per key在联合键上聚合连接的每一侧并计算每个键的行数
  2. Multiply the counts of each side of the join to determine the output row counts将联接每一侧的计数相乘以确定 output 行计数

The easiest way to do this is by opening the Analysis (Contour) tool in Foundry and performing the following analysis:最简单的方法是在 Foundry 中打开分析(轮廓)工具并执行以下分析:

  1. Add df1 as input to a first pathdf1作为输入添加到第一条路径

  2. Add Pivot Table board, using col_1 as the rows, and Row count as the aggregate添加Pivot Table board,使用col_1作为行, Row count作为聚合

    枢

  3. Click the ⇄ Switch to pivoted data button单击⇄ Switch to pivoted data按钮

    转变

  4. Use the Multi-Column Editor board to keep only col_1 and the COUNT column.使用Multi-Column Editor板只保留col_1COUNT列。 Prefix each of them with df1_ , resulting in an output from the path which is only df1_col_1 and df1_COUNT .为它们中的每一个添加前缀df1_ ,从只有df1_col_1和 df1_COUNT 的路径中产生df1_COUNT

  5. Add df2 as input to a second pathdf2作为输入添加到第二条路径

  6. Add Pivot Table board, again using col_1 as the rows, and Row count as the aggregate添加Pivot Table board,再次使用col_1作为行, Row count作为聚合

    枢

  7. Click the ⇄ Switch to pivoted data button单击⇄ Switch to pivoted data按钮

    转变

  8. Use the Multi-Column Editor board to keep only col_1 and the COUNT column.使用Multi-Column Editor板只保留col_1COUNT列。 Prefix each of them with df2_ , resulting in an output from the path which is only df2_col_1 and df2_COUNT .df2_为它们中的每一个添加前缀,从只有df2_col_1和 df2_COUNT 的路径中产生df2_COUNT

  9. Create a third path, using the result of the first path ( df1_col_1 and df1_COUNT1 )使用第一条路径( df1_col_1df1_COUNT1 )的结果创建第三条路径

  10. Add a Join board, making the right side of the join the result of the second path ( df2_col_1 and df2_col_1 ).添加一个连接板,使Join的右侧成为第二条路径( df2_col_1df2_col_1 )的结果。 Ensure the join type is Full join确保连接类型是Full join

    加入

  11. Add all columns from the right side (you don't need to add a prefix, all the columns are unique从右侧开始添加所有列(不需要加前缀,所有列都是唯一的

  12. Configure the join board to join on df1_col_1 equals df2_col_1配置加入板以加入df1_col_1等于df2_col_1

    加盟条件

  13. Add an Expression board to create a new column, output_row_count which multiplies the two COUNT columns together添加一个Expression板以创建一个新列output_row_count ,它将两个COUNT列相乘

    乘

  14. Add a Sort board that sorts on output_row_count descending添加一个按output_row_count降序排序的Sort

    种类

  15. If you now preview the resultant data, you will have a sorted list of keys from both sides of the join that are causing the skew如果您现在预览结果数据,您将得到一个来自连接两侧的导致偏斜的键的排序列表

    预习

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Pyspark 和 Palantir Foundry 中使用多个语句将列的值设置为 0 - How do I set value to 0 of column with multiple statements in Pyspark and Palantir Foundry 如何在 Palantir Foundry 中检查列是否始终具有相同的值? - How do I check a column always has the same value in Palantir Foundry? 如何使用 Palantir Foundry 在 Pyspark 中编写 case 语句 - How do I write case statements in Pyspark using Palantir Foundry 如何在 Palantir Foundry 中解析 xml 文档? - How do I parse xml documents in Palantir Foundry? 如何在 Foundry 中解析大型压缩 csv 文件? - How do I parse large compressed csv files in Foundry? 您如何使用 Databricks 作业任务参数或笔记本变量来设置彼此的值? - How do you use either Databricks Job Task parameters or Notebook variables to set the value of each other? 我的 Foundry 作业中任务的理论最大并行度是多少? - What is the theoretical max parallelism of tasks in my Foundry Job? 如何将数据集转换为 repo 中的字典。 我在铸造厂内使用 pyspark - How do I transform the data set into a dictionary inside the repo. I am using pyspark within foundry 在 Palantir Foundry 中,由于无法使用打印语句,我该如何调试 pyspark(或 pandas)UDF? - In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements? 在 Palantir Foundry 中,如何使用 OOMing 驱动程序或执行程序解析一个非常大的 csv 文件? - In Palantir Foundry how do I parse a very large csv file with OOMing the driver or executor?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM