简体繁体 English

在 Spark 中重新分区大文件

[英]Repartioning Large Files in Spark

原文 2020-11-02 10:03:49 4 1 scala/ apache-spark/ apache-spark-sql/ spark-streaming

I am very new to Spark and got a file of 1 TB to process.我对 Spark 非常陌生，需要处理一个 1 TB 的文件。

My system specification is :我的系统规格是：

Each node: 64 GB RAM每个节点：64 GB RAM
Number Of nodes:2节点数：2
Cores per node: 5每个节点的核心数：5

As I know I have to repartition the data for better parallelism as spark will try to create default partition only by (totalNumber of cores * 2 or 3 or 4).据我所知，我必须重新分区数据以获得更好的并行性，因为 spark 将尝试仅通过 (totalNumber of cores * 2 or 3 or 4) 创建默认分区。 But in my case since Data file is very huge, I have to repartition this data to a number such that this data can be processed in a efficient manner.但在我的情况下，由于数据文件非常大，我必须将这些数据重新分区为一个数字，以便可以有效地处理这些数据。

How to choose the number of Partitions to be passed in repartition??How should I calculate it?What approach I should take to solve this..重新分区时如何选择要传递的Partition数量？？应该怎么计算？我应该采取什么方法来解决这个问题。。

Thanks a lot in advance.非常感谢。

1 个解决方案

partitions and parallelism are two different things per my understanding.根据我的理解， partitions和parallelism是两个不同的东西。 However both go hand in hand when it comes to parallel executions of tasks in Spark.然而，当谈到在 Spark 中并行执行任务时，两者是齐头并进的。 Parallelism is number of executors * number of cores , which in your case is 2 * 5 = 10 .并行度是number of executors * number of cores ，在您的情况下为2 * 5 = 10 。 So at any given moment you could have 10 tasks running at most.因此，在任何给定时刻，您最多可以运行 10 个任务。 If your data is divided into 10 partitions then all of it would be processing at once.如果您的数据被分成 10 个分区，那么所有分区都会被同时处理。 However if you have 20 partitions then Spark would start processing 10 partitions and based on when each task finish , spark will schedule next partitions to process.但是，如果您有 20 个分区，那么 Spark 将开始处理 10 个分区，并根据每个任务的完成时间，spark 将安排下一个要处理的分区。 This will happen until it finish processing all the partitions.这将发生直到它完成处理所有分区。

By default one partition is one block of data.默认情况下，一个分区是一个数据块。 I am guessing your 1 TB of Data is stored on HDFS.我猜你的 1 TB 数据存储在 HDFS 上。 If underlying block size is 256MB then you would have 1TB/256MB number of blocks which in turn are partitions.如果底层块大小为 256MB，那么您将拥有 1TB/256MB 数量的块，而这些块又是分区。

Please note that once the data is read you can always repartition it based on your requirement.请注意，一旦读取数据，您始终可以根据您的要求对其进行重新分区。

How to choose the number of Partitions to be passed in repartition??How should I calculate it?What approach I should take to solve this..重新分区时如何选择要传递的Partition数量？？应该怎么计算？我应该采取什么方法来解决这个问题。。

You need to see how your spark application holds up with the size of partition and then determine if you can decrease or increase that number.您需要查看您的 Spark 应用程序如何承受分区的大小，然后确定是否可以减少或增加该数字。 One thing is the executor memory consideration as well.一件事也是对执行程序内存的考虑。 If your partition is too big then you can run into OutOfMemory errors as well.如果您的分区太大，那么您也可能遇到 OutOfMemory 错误。 These are just the guidelines and not the extensive list.这些只是指导方针，而不是广泛的清单。

This https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ multipart series has more detailed discussion on partitions and executors.这个https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/ multipart 系列对分区和执行器进行了更详细的讨论。