简体繁体 English

当我的数据规模较小时，如何在 Palantir Foundry 转换中获得更好的性能？

[英]How do I get better performance in my Palantir Foundry transformation when my data scale is small?

原文 2022-01-20 19:37:03 9 1 palantir-foundry/ foundry-code-repositories/ foundry-code-workbooks

I have datasets that are each under 1GB in size, and the total output size of my transformation is under 1GB.我的数据集大小均小于 1GB，而我的转换的总 output 大小小于 1GB。 I've noticed that my workbook builds are pretty slow for the data scale I would expect, and I'm wondering what 'dials' I can turn to optimize these.我注意到我的工作簿构建对于我期望的数据规模来说非常慢，我想知道我可以转向哪些“拨号”来优化这些。

For example, I see in the Spark Details of a build that several of my stages have 200 tasks, and each of these tasks is only getting a couple of KB of data.例如，我在构建的 Spark 详细信息中看到，我的几个阶段有 200 个任务，每个任务只获取几 KB 的数据。 Is that right?那正确吗？

1 个解决方案

At this scale of data, there's a couple of things you can influence about your build to make it more appropriately optimized.在这种数据规模下，您可以对构建产生一些影响，以使其更适当地优化。

Ensure your Code Workbook / Code Repository is using AQE确保您的代码工作簿/代码存储库正在使用 AQE

It's worth verifying your Build is running using AQE as noted over here.值得验证您的构建是否使用AQE运行，如此处所述。 This will ensure your stages don't split up their work into 200 tasks (way too many for this scale, tasks sized in the KB range will suffer from too much network I/O).这将确保您的阶段不会将其工作分成 200 个任务（对于这种规模来说太多了，KB 范围内的任务将受到太多网络 I/O 的影响）。

The default sizes for tasks is probably fine for your job, so don't modify the advisory partition sizes unless proven otherwise.任务的默认大小可能适合您的工作，因此除非另有证明，否则不要修改建议分区大小。

Consider using Local mode考虑使用本地模式

Since your data scale is small enough, you might consider using what's called Spark "Local Mode".由于您的数据规模足够小，您可以考虑使用所谓的 Spark“本地模式”。 This is when you don't use any Executors to do your work and instead hold the entire contents of your job inside the Driver itself.这是您不使用任何 Executor 来完成您的工作，而是将您的工作的全部内容保存在 Driver 本身中的时候。 This means you don't move data across the cluster to perform join s, window s, groupBy s etc, but instead you keep it all in memory on the Driver's host.这意味着您不会跨集群移动数据以执行join 、 window 、 groupBy等，而是将其全部保存在驱动程序主机上的 memory 中。 This only works so long as all your data can indeed fit into memory, but for small scales where this is true, it means your data is substantially faster to access and use.这只适用于您的所有数据确实可以放入 memory 中，但对于确实如此的小规模，这意味着您的数据访问和使用速度要快得多。

In Code Repositories, you would apply KUBERNETES_NO_EXECUTORS to your transform, in Code Workbooks you'll want to reach out to your Palantir support engineers to configure this behavior.在代码存储库中，您可以将KUBERNETES_NO_EXECUTORS应用于您的转换，在代码工作簿中，您需要联系您的 Palantir 支持工程师来配置此行为。

What you'll then see is your transform having zero executors assigned to it, but still some tasks running in parallel.然后您将看到您的转换分配了零个执行程序，但仍有一些任务并行运行。 They will all just be running in parallel on your driver using each core the Driver has.它们都将使用驱动程序拥有的每个内核在您的驱动程序上并行运行。 NOTE: be very careful not to boost the number of cores too high otherwise you will increase your risk of OOM per guidance here .注意：请非常小心，不要将内核数量增加得太高，否则您将增加此处的指导导致 OOM 的风险。 Essentially, the fractional share of memory per core as you increase core counts actually decreases , which will increase the risk of an individual task OOMing.从本质上讲，随着内核数量的增加，每个内核的 memory 的份额实际上会减少，这将增加单个任务 OOMing 的风险。 You also don't want to subscribe too many cores to the Driver for better 'parallelism' because you likely should consider using the standard Executor-based compute setup if you go much beyond 4 parallel tasks.您也不想为驱动程序订阅太多内核以获得更好的“并行性”，因为如果您的 go 远远超过 4 个并行任务，您可能应该考虑使用基于标准执行程序的计算设置。

Since you now are using only resources on your Driver, you may need to boost the number of cores to support the max number of tasks that are running.由于您现在仅使用驱动程序上的资源，因此您可能需要增加内核数量以支持正在运行的最大任务数。 In a typical setup, this is 4, so you would apply DRIVER_CORES_LARGE in Code Repositories, and similarly would reach out to your Palantir Support for configuration in a Code Workbook.在典型设置中，此值为 4，因此您将在代码存储库中应用DRIVER_CORES_LARGE ，并且同样会联系您的 Palantir 支持以在代码工作簿中进行配置。

As an additional commentary, it's worth highlighting that Spark itself goes through a query planning process using the Catalyst engine whereby optimizations are made for your job to do the least amount of work possible when building the output.作为补充说明，值得强调的是，Spark 本身使用Catalyst 引擎进行查询规划过程，从而在构建 output 时为您的工作进行优化以尽可能减少工作量。 These optimization take time to perform, which means you may observe time being spent planning your query that exceeds the actual execution of your query.这些优化需要时间来执行，这意味着您可能会发现计划查询所花费的时间超过了查询的实际执行时间。 In scales above ~1GB of input size, this is a feature;在超过约 1GB 输入大小的范围内，这是一个特征； In the scale of this example, it means your performance is slightly worse than a simpler system.在此示例的规模中，这意味着您的性能比更简单的系统稍差。 In the case your data scale increases, however, this optimization step is crucial to maintain scalability and performance.但是，如果您的数据规模增加，此优化步骤对于保持可扩展性和性能至关重要。

What settings shall we do in Foundry/Spark when we have large datasets generating small but high numbers of outputs files which needs to be send over via Magritte, use case scenario: In terms of data transfer, the bottleneck is the number of files generated in Foundry(50k+) more than the total size (6GB) – 2hrs each way当我们有大型数据集生成少量但大量需要通过 Magritte 发送的输出文件时，我们应该在 Foundry/Spark 中进行哪些设置，用例场景：在数据传输方面，瓶颈是生成的文件数量Foundry(50k+) 超过总大小 (6GB) – 单程 2 小时