简体繁体 English

提高 Spark SQL 处理数十亿行数据的性能

[英]Improve performance of processing billions-of-rows data in Spark SQL

原文 2020-01-08 15:09:04 0 2 sql/ apache-spark/ dataset/ partitioning

In my corporate project, I need to cross join a dataset of over a billion rows with another of about a million rows using Spark SQL.在我的公司项目中，我需要使用 Spark SQL 将超过 10 亿行的数据集与另一个约 100 万行的数据集交叉连接。 As cross join was used, I decided to divide the first dataset into several parts (each having about 250 million rows) and cross join each part with the million-row one.由于使用了交叉连接，我决定将第一个数据集分成几个部分（每个部分大约有 2.5 亿行），并将每个部分与百万行数据进行交叉连接。 I then made of use "union all".然后我使用“union all”。

Now I need to improve the performance of the join processes.现在我需要改进连接过程的性能。 I heard it can be done by partitioning data and distribution of work to Spark workers.我听说可以通过对数据进行分区并将工作分配给 Spark 工作人员来完成。 My questions are how the effective performance can be made with partitioning?我的问题是如何通过分区实现有效性能？ and What are the other ways to do this without using partitioning?在不使用分区的情况下，还有哪些其他方法可以做到这一点？

Edit: filtering already included.编辑：过滤已经包括在内。

2 个解决方案

Well, in all scenarios, you will end up with tons of data.好吧，在所有情况下，您最终都会得到大量数据。 Be careful, try to avoid cartesian joins on big data set as much as possible as it usually ends with OOM exceptions.小心，尽量避免大数据集上的笛卡尔连接，因为它通常以 OOM 异常结束。

Yes, partitioning can be the way that help you, because you need to distribute your workload from one node to more nodes or even to the whole cluster.是的，分区可以为您提供帮助，因为您需要将工作负载从一个节点分布到多个节点，甚至分布到整个集群。 Default partitioning mechanism is hash of key or original partitioning key from source (Spark is taking this from source directly).默认分区机制是来自源的键或原始分区键的散列（Spark 直接从源获取）。 You need to first evaluate what is your partitioning key right now and afterwards you can find maybe better partitioning key/mechanism and repartition data, therefore distribute load.您现在需要首先评估您的分区键是什么，然后您可以找到更好的分区键/机制和重新分区数据，从而分配负载。 But, anyway join must be done, but it will be done with more parallel sources.但是，无论如何连接都必须完成，但它将使用更多并行源完成。

There should be some filters on your join query.您的连接查询应该有一些过滤器。 you can use filter attributes as key to partition the data and then join based on the partitioned.您可以使用过滤器属性作为键对数据进行分区，然后根据分区进行连接。