scala dataframe 操作的性能改进

Question

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case.我正在使用一个按load_date列分区并每周使用 delta optimize命令进行优化的表作为我的用例的源数据集。

The table schema is as shown below:表架构如下所示：

+-----------------+--------------------+------------+---------+--------+---------------+
|               ID|          readout_id|readout_date|load_date|item_txt| item_value_txt|
+-----------------+--------------------+------------+---------+--------+---------------+

Later this table will be pivoted on columns item_txt and item_value_txt and many operations are applied using multiple window functions as shown below:稍后该表将在item_txt和item_value_txt列上进行旋转，并且使用多个 window 函数应用许多操作，如下所示：

val windowSpec = Window.partitionBy("id","readout_date")
val windowSpec1 = Window.partitionBy("id","readout_date").orderBy(col("readout_id") desc)
val windowSpec2 = Window.partitionBy("id").orderBy("readout_date")
val windowSpec3 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpec4 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)

These window functions are used to achieve multiple logic on the data.这window个函数是用来实现数据上的多重逻辑的。 Even there are few joins used to process the data.甚至很少有连接用于处理数据。

The final table is partitioned with readout_date and id and could see the performance is very poor as it take much time for 100 ids and 100 readout_date最终表是用readout_date和id分区的，可以看到性能很差，因为 100 个id和 100 个readout_date需要很多时间

If I am not partitioning the final table I am getting the below error.如果我没有对最终表进行分区，则会出现以下错误。

Job aborted due to stage failure: Total size of serialized results of 129 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

The expected count of id in production is billions and I expect much more throttling and performance issues while processing with complete data.生产中id的预期数量为数十亿，我预计在处理完整数据时会出现更多的节流和性能问题。

Below provided the cluster configuration and utilization metrics.下面提供了集群配置和利用率指标。

Please let me know if anything is wrong while doing repartitioning, any methods to improve cluster utilization, to improve performance...请让我知道在重新分区时是否有任何问题，任何提高集群利用率的方法，提高性能...

Any leads Appreciated!任何线索表示赞赏！

Answer 1

spark.driver.maxResultSize is just a setting you can increase it. spark.driver.maxResultSize 只是一个设置，您可以增加它。 BUT it's set at 4Gigs to warn you you are doing bad things and you should optimize your work.但是它设置为 4Gigs 以警告您您正在做坏事并且您应该优化您的工作。 You are doing the correct thing asking for help to optimize.你正在做正确的事情寻求帮助来优化。

The first thing I suggest if you care about performance get rid of the windows. The first 3 windows you use could be achieved using Groupby and this will perform better.如果您关心性能，我建议的第一件事是摆脱 windows。您使用的前 3 个 windows 可以使用 Groupby 实现，这会表现得更好。 The last two windows are definitely harder to reframe as a group by, but with some reframing of the problem you might be able to do it.最后两个 windows 肯定更难作为一个分组进行重构，但是通过对问题进行一些重构，你也许可以做到这一点。 The trick could be to use multiple queries instead of one.技巧可能是使用多个查询而不是一个。 And you might think that would perform worse but i'm here to tell you if you can avoid using a window you will get better performance almost every time.你可能认为那会表现更差，但我在这里告诉你，如果你可以避免使用 window，你几乎每次都会获得更好的性能。 Windows aren't bad things, they are a tool to be used but they do not perform well on unbounded data. Windows 不是坏事，它们是可以使用的工具，但在无界数据上表现不佳。 (Can you do anything as an intermediate step to reduce the data the window needs to examine?) Or can you use aggregate functions to complete the work without having to use a window? （作为中间步骤，您可以做任何事情来减少 window 需要检查的数据吗？）或者您可以使用聚合函数来完成工作而不必使用 window 吗？ You should explore your options.你应该探索你的选择。

Answer 2

Given your other answers, you should be grouping by ID not windowing by Id.鉴于您的其他答案，您应该按 ID 分组而不是按 Id 开窗。 And likely using aggregates(sum) by week of year/month.并且可能按年/月的一周使用聚合（总和）。 This would likely give you really speedy performance with the loss of some granularity.这可能会给您带来真正快速的性能，但会损失一些粒度。 This would give you enough insight to decide to look into something deeper... or not.这会给你足够的洞察力来决定深入研究……或不研究。

If you wanted more accuracy, I'd suggest using: Converting your null's to 0's.如果你想要更高的准确性，我建议使用：Converting your null's to 0's。

val windowSpec1 = Window.partitionBy("id").orderBy(col("readout_date") asc) // asc is important as it flips the relationship so that it groups the previous nulls

Then create a running total on the SIG_XX VAL or whatever signal you want to look into.然后在SIG_XX VAL或您想要查看的任何信号上创建一个运行总计。 Call the new column 'null-partitions'.将新列称为“空分区”。

This will effectively allow you to group the numbers(by null-partitions) and you can then run aggregate functions using group by to complete your calculations.这将有效地允许您对数字进行分组（按空分区），然后您可以使用 group by 运行聚合函数来完成您的计算。 Window and group by can do the same thing, windows just more expensive in how it moves data, slowing things down. Window 和 group by 可以做同样的事情，windows 只是移动数据的方式更昂贵，减慢了速度。 Group by uses a more of the cluster to do the work and speeds up the process. Group by 使用更多的集群来完成工作并加快进程。

scala dataframe 操作的性能改进

问题描述

2 个解决方案

解决方案1
0 2022-04-19 12:40:37

解决方案2
0 2022-04-19 19:40:12

scala dataframe 操作的性能改进

问题描述

2 个解决方案

解决方案1 0 2022-04-19 12:40:37

解决方案2 0 2022-04-19 19:40:12

解决方案1
0 2022-04-19 12:40:37

解决方案2
0 2022-04-19 19:40:12