简体   繁体   English

如何更快地计算我的 Foundry“最新版本”数据集?

[英]How do I compute my Foundry 'latest version' dataset faster?

I have a dataset ingesting the latest edits to rows of my data, but it only ingests the recently edited version.我有一个数据集,用于摄取对我的数据行的最新编辑,但它只摄取最近编辑过的版本。 (ie it's incremental on an update_ts timestamp column). (即它在update_ts时间戳列上是增量的)。

Original table:原表:

| primary_key | update_ts |
|-------------|-----------|
| key_1       | 0         |
| key_2       | 0         |
| key_3       | 0         |

Table as it gets updated:表更新时:

| primary_key | update_ts |
|-------------|-----------|
| key_1       | 0         |
| key_2       | 0         |
| key_3       | 0         |
| key_1       | 1         |
| key_2       | 1         |
| key_1       | 2         |

After ingestion, I need to compute the 'latest version' for all prior updates while also taking into account any new edits.摄取后,我需要计算所有先前更新的“最新版本”,同时还要考虑任何新的编辑。

This means I am taking the incremental ingest and running a SNAPSHOT output each time.这意味着我每次都会进行增量摄取并运行 SNAPSHOT 输出。 This is very slow for my build as I've noticed I have to look over all my output rows every time I want to compute the latest version for my data.这对于我的构建来说非常慢,因为我注意到每次我想为我的数据计算最新版本时,我都必须查看所有输出行。

Transaction n=1 (SNAPSHOT):交易 n=1(快照):

| primary_key | update_ts |
|-------------|-----------|
| key_1       | 0         |
| key_2       | 0         |
| key_3       | 0         |

Transaction n=2 (APPEND):交易 n=2(追加):

| primary_key | update_ts |
|-------------|-----------|
| key_1       | 1         |
| key_2       | 1         |

How can I make this 'latest version' computation faster?我怎样才能使这个“最新版本”的计算速度更快?

This is a common pattern that will benefit from bucketing .这是一种可以从分桶中受益的常见模式。

The gist of this is: write your output SNAPSHOT into buckets based on your primary_key column, where the expensive step of shuffling your much much larger output is skipped entirely.其要点是:根据您的primary_key列将您的输出 SNAPSHOT 写入存储桶,其中完全跳过了对大得多的输出进行改组的昂贵步骤。

This means you will only have to exchange your new data to the buckets that already contain your prior history.这意味着您只需将新数据交换到已经包含您之前历史记录的存储桶。

Let's start from the initial state, where we are running on a prior-computed 'latest' version that was a slow SNAPSHOT:让我们从初始状态开始,我们在先前计算的“最新”版本上运行,该版本是一个慢速 SNAPSHOT:

- output: raw_dataset
  input: external_jdbc_system
  hive_partitioning: none
  bucketing: none
  transactions:
    - SNAPSHOT
    - APPEND
    - APPEND
- output: clean_dataset
  input: raw_dataset
  hive_partitioning: none
  bucketing: none
  transactions:
    - SNAPSHOT
    - SNAPSHOT
    - SNAPSHOT

If we write out clean_dataset using bucketing over the primary_key column into a count of buckets calculated separately to fit the datascale we anticipate, we would need the following code:如果我们在primary_key列上使用分桶将clean_dataset写入单独计算的桶数以适合我们预期的数据规模,我们将需要以下代码:

from transforms.api import transform, Input, Output
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window


@transform(
    my_output=Output("/datasets/clean_dataset"),
    my_input=Input("/datasets/raw_dataset")
)
def my_compute_function(my_input, my_output):

    BUCKET_COUNT = 600
    PRIMARY_KEY = "primary_key"
    ORDER_COL = "update_ts"

    updated_keys = my_input.dataframe("added")
    last_written = my_output.dataframe("current")

    updated_keys.repartition(BUCKET_COUNT, PRIMARY_KEY)

    value_cols = [x for x in last_written.columns if x != PRIMARY_KEY]

    updated_keys = updated_keys.select(
      PRIMARY_KEY,
      *[F.col(x).alias("updated_keys_" + x) for x in value_cols]
    )

    last_written = last_written.select(
      PRIMARY_KEY,
      *[F.col(x).alias("last_written_" + x) for x in value_cols]
    )

    all_rows = updated_keys.join(last_written, PRIMARY_KEY, "fullouter")
    
    latest_df = all_rows.select(
      PRIMARY_KEY,
      *[F.coalesce(
          F.col("updated_keys_" + x),
          F.col("last_written_" + x)
        ).alias(x) for x in value_cols]
    )

    my_output.set_mode("replace")

    return my_output.write_dataframe(
        latest_df,
        bucket_cols=PRIMARY_KEY,
        bucket_count=BUCKET_COUNT,
        sort_by=ORDER_COL
    )

When this runs, you'll notice in your query plan that the project step over the output no longer includes an exchange , which means it won't be shuffling that data.当它运行时,你会注意到在你的查询计划中项目步骤输出不再包含 exchange ,这意味着它不会对数据进行混洗。 The only exchange you'll now see is on the input where it needs to distribute the changes in the exact same manner as the output was formatted (this is a very fast operation).您现在将看到的唯一交换是在输入上,它需要以与格式化输出完全相同的方式分发更改(这是一个非常快的操作)。

This exchange is then preserved into the fullouter join step, where the join will then exploit this and run the 600 tasks very quickly.然后将此交换保留到fullouter join 步骤中,然后 join 将利用它并非常快速地运行 600 个任务。 Finally, we maintain the format on the output by explicitly bucketing into the same number of buckets over the same columns as before.最后,我们通过在与以前相同的列上显式地分桶到相同数量的桶中来维护输出的格式。

NOTE: with this approach, your file sizes in each bucket will grow over time and not take into account the need to increase bucket counts to keep things nicely sized.注意:使用这种方法,您在每个存储桶中的文件大小会随着时间的推移而增长,并且没有考虑增加存储桶数量以保持大小合适的需要。 You will eventually hit a threshold with this technique where file sizes get above 128MB and you are no longer executing efficiently (the fix is to bump the BUCKET_COUNT value).您最终会使用此技术达到一个阈值,即文件大小超过 128MB 并且您不再有效执行(修复方法是增加BUCKET_COUNT值)。

Your output will now look like the following:您的输出现在将如下所示:

- output: raw_dataset
  input: external_jdbc_system
  hive_partitioning: none
  bucketing: none
  transactions:
    - SNAPSHOT
    - APPEND
    - APPEND
- output: clean_dataset
  input: raw_dataset
  hive_partitioning: none
  bucketing: BUCKET_COUNT by PRIMARY_KEY
  transactions:
    - SNAPSHOT
    - SNAPSHOT
    - SNAPSHOT

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM