简体繁体 English

Spark：理解分区 - 核心

[英]Spark: understanding partitioning - cores

原文 2017-10-28 16:38:31 5 1 multithreading/ scala/ apache-spark/ cpu-cores

I'd like to understand partitioning in Spark.我想了解 Spark 中的分区。 I am running spark in local mode on windows 10. My laptop has 2 physical cores and 4 logical cores.我在 Windows 10 上以本地模式运行 spark。我的笔记本电脑有 2 个物理内核和 4 个逻辑内核。

1/ Terminology : to me, a core in spark = a thread. 1/ 术语：对我来说，spark 中的一个核心 = 一个线程。 So a core in Spark is different than a physical core, right?所以 Spark 中的内核不同于物理内核，对吧？ A Spark core is associated to a task, right? Spark 核心与任务相关联，对吗？ If so, since you need a thread for a partition, if my sparksql dataframe has 4 partitions, it needs 4 threads right?如果是这样，由于您需要一个线程用于分区，如果我的 sparksql 数据帧有 4 个分区，则它需要 4 个线程，对吗？

2/ If I have 4 logical cores, does it mean that I can only run 4 concurrent threads at the same time on my laptop? 2/ 如果我有 4 个逻辑核心，是否意味着我只能在我的笔记本电脑上同时运行 4 个并发线程？ So 4 in Spark?所以 4 在 Spark 中？

3/ Setting the number of partitions : how to choose the number of partitions of my dataframe, so that further transformations and actions run as fast as possible? 3/ 设置分区数：如何选择我的数据帧的分区数，以便尽可能快地运行进一步的转换和操作？ -Should it have 4 partitions since my laptop has 4 logical cores? - 因为我的笔记本电脑有 4 个逻辑核心，所以它应该有 4 个分区吗？ -Is the number of partitions related to physical cores or logical cores? - 分区数是与物理核相关还是与逻辑核相关？ -In spark documentations, it's written that you need 2-3 tasks per CPU. - 在 spark 文档中，写到每个 CPU 需要 2-3 个任务。 Since I have two physical coresn should the nb of partitions be equal to 4or6?既然我有两个物理内核，那么分区的 nb 应该等于 4 或 6 吗？

(I know that number of partitions will not have much effect on local mode, but this is just to understand) （我知道分区数对本地模式不会有太大影响，但这只是为了理解）

1 个解决方案

Theres no such thing as a "spark core".没有“火花芯”这样的东西。 If you are referring to options like --executor-cores then yes, that refers to how many tasks each executor will run concurrently.如果您指的是--executor-cores类的选项，那么是的，这是指每个执行程序将同时运行多少个任务。
You can set the number of concurrent tasks to whatever you want, but more than the number of logical cores you have probably won't give and advantage.您可以将并发任务的数量设置为您想要的任何数量，但超过您拥有的逻辑核心数量可能不会提供优势。
Number of partitions to use is situational.要使用的分区数视情况而定。 Without knowing the data or the transformations you are doing it's hard to give a number.在不知道您正在执行的数据或转换的情况下，很难给出一个数字。 Typical advice is to use just below a multiple of your total cores., for example, if you have 16 cores, maybe 47, 79, 127 and similar numbers just under a multiple of 16 are good to use.典型的建议是使用略低于总核心数的倍数。例如，如果您有 16 个核心，则可能 47、79、127 以及略低于 16 倍数的类似数字都可以使用。 The reason for this is you want to make sure all cores are working (as little time as possible do you have resources idle, waiting for others to finish).这样做的原因是您希望确保所有内核都在工作（尽可能少的时间让您有资源空闲，等待其他人完成）。 but you leave a little extra to allow for speculative execution (spark may decide to run the same task twice if it is running slowly to see if it will go faster on a second try).但是你留下了一些额外的东西以允许推测执行（如果它运行缓慢，spark可能会决定运行相同的任务两次，看看它是否会在第二次尝试时运行得更快）。

Picking the number is a bit of trial and error though, Take advantage of the spark job server to monitor how your tasks are running.不过，选择数字需要反复试验，利用 spark 作业服务器来监控任务的运行方式。 Having few tasks with many of records each means you should probably increase the number of partitions, on the other hand, many partitions with only a few records each is also bad and you should try to reduce the partitioning in these cases.有多个记录的任务很少意味着您可能应该增加分区的数量，另一方面，许多每个只有几条记录的分区也很糟糕，在这些情况下您应该尝试减少分区。