简体   繁体   English

SPARK Partitions 和 Worker Cores 有什么区别?

[英]What is the difference between SPARK Partitions and Worker Cores?

I used the Standalone Spark Cluster to process several files.我使用Standalone Spark Cluster来处理几个文件。 When I executed the Driver, the data was processed on each worker using it's cores.当我执行驱动程序时,数据是在每个工人上使用它的核心处理的。

Now, I've read about Partitions , but I didn't get it if it's different than Worker Cores or not.现在,我已经阅读了Partitions ,但我不明白它是否与 Worker Cores 不同。

Is there a difference between setting cores number and partition numbers ?设置cores numberpartition numbers有区别吗?

Simplistic view: Partition vs Number of Cores简单视图:分区与核心数

When you invoke an action an RDD,当你调用一个 RDD 的动作时,

  • A "Job" is created for it.为它创建了一个“作业”。 So, Job is a work submitted to spark.所以,Job 是一个提交给 spark 的工作。
  • Jobs are divided in to "STAGE" based n the shuffle boundary!!!工作分为基于 n shuffle 边界的“STAGE”!!!
  • Each stage is further divided to tasks based on the number of partitions on the RDD.每个阶段根据 RDD 上的分区数量进一步划分为任务。 So Task is smallest unit of work for spark.所以Task是spark的最小工作单元。
  • Now, how many of these tasks can be executed simultaneously depends on the "Number of Cores" available!!!现在,可以同时执行多少个这些任务取决于可用的“核心数”!!!

Partition (or task) refers to a unit of work.分区(或任务)是指一个工作单元。 If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD.如果你有一个 200G 的 hadoop 文件作为 RDD 加载并按 128M 分块(Spark 默认),那么你在这个 RDD 中有大约 2000 个分区。 The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.核心数决定了一次可以处理多少个分区,最多 2000 个(以分区/任务数为上限)可以并行执行此 RDD。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 WorkManager 中的 Worker 和 ListenableWorker 有什么区别? - What is the difference between Worker and ListenableWorker in WorkManager? Spark序列化和Java序列化有什么区别? - What is the difference between Spark Serialization and Java Serialization? spark的spark调度模式和应用程序队列之间有什么区别? - what is the difference between spark scheduling mode and application queue in spark? Spark 独立模式:有没有办法以编程方式从 Spark 的 localhost:8080 获取每个工作人员的内核/内存信息 - Spark Standalone Mode: Is there a way to programmatically get cores/memory information for each worker from Spark's localhost:8080 在 spark 中设置 conf spark.default.parallelism 和调用方法 rdd.coalesce() 有什么区别? - In spark what is the difference between setting the conf spark.default.parallelism and calling the method rdd.coalesce()? 使用spark-submit和java -cp运行spark应用程序时有什么区别? - What is the difference between when I run a spark application using spark-submit and java -cp? Swing Worker和普通线程之间的区别? - Difference between Swing Worker and normal Threads? 标准verticle和workerverticle的区别 - Difference between standard verticle and worker verticle Apache Spark 中的执行器和内核 - Executors and cores in Apache Spark 合并时的 Spark 分区大小 - Spark partitions size on coalesce
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM