简体   繁体   English

使用单个开发/测试机器同时拥有Spark进程分区

[英]having Spark process partitions concurrently, using a single dev/test machine

I'm naively testing for concurrency in local mode, with the following spark context 我天真地在本地模式下测试并发性,具有以下spark上下文

SparkSession
      .builder
      .appName("local-mode-spark")
      .master("local[*]")
      .config("spark.executor.instances", 4)
      .config("spark.executor.cores", 2)
      .config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
      .config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
      .getOrCreate()

and a mapPartitions API call like follows: mapPartitions API调用如下:

import spark.implicits._ 

val inputDF : DataFrame = spark.read.parquet(inputFile)

val resultDF : DataFrame =
    inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF

On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). 从表面上看,这确实在我的MyIterator包含的代码中出现了一个并发错误(不是Spark代码中的错误)。 However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve. 但是,我希望看到我的应用程序将在生产中以及在此测试期间处理所有可用的机器资源,以便发现额外的并发错误的机会将得到改善。

That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF , while there's plenty of free RAM and the JVM Xmx poses no real limitation. 到目前为止,对我来说显然不是这样的情况:我的机器在inputDF的繁重处理过程中只有非常低的CPU利用率,而且有足够的空闲RAM并且JVM Xmx没有真正的限制。

How would you recommend testing for concurrency using your local machine? 您如何建议使用本地计算机测试并发性? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator ? 目标是在生产中测试,Spark会不会在MyIterator由spark应用的代码中MyIterator线程安全或其他并发问题?

Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? 或者甚至可以在spark local模式下,并行处理输入数据帧的单独分区? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode? 我是否可以在同一台机器上的同一数据帧上同时工作,最好是在本地模式下?

  1. Max parallelism 最大并行度

You are already running spark in local mode using .master("local[*]") . 您已使用.master("local[*]")在本地模式下运行spark。

local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number). local [*]使用的线程数与Java虚拟机可用的处理器数一样多(它使用Runtime.getRuntime.availableProcessors()来知道数字)。

  1. Max memory available to all executors/threads 所有执行程序/线程可用的最大内存

I see that you are not setting the driver memory explicitly. 我发现你没有明确设置驱动程序内存。 By default the driver memory is 512M . 默认情况下,驱动程序内存为512M If your local machine can spare more than this, set this explicitly. 如果您的本地计算机可以节省更多,请明确设置。 You can do that by either: 你可以通过以下两种方式做到:

  1. setting it in the properties file (default is spark-defaults.conf), 在属性文件中设置它(默认为spark-defaults.conf),

     spark.driver.memory 5g 
  2. or by supplying configuration setting at runtime 或者在运行时提供配置设置

     $ ./bin/spark-shell --driver-memory 5g 

Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory. 请注意,这不能通过在应用程序中设置它来实现,因为到那时已经太晚了,该进程已经开始使用一些内存。

  1. Nature of Job 工作性质

Check number of partitions in your dataframe. 检查数据框中的分区数。 That will essentially determine how much max parallelism you can use. 这基本上决定了你可以使用多少最大并行度。

inputDF.rdd.partitions.size 

If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. 如果此输出为1,则表示您的数据帧只有1个分区,因此当您对此数据帧执行操作时,您将无法获得并发性。 In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks. 在这种情况下,您可能需要调整一些配置以创建更多数量的分区,以便您可以同时运行任务。

Running local mode cannot simulate a production environment for the following reasons. 由于以下原因,运行本地模式无法模拟生产环境。

  1. There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. 代码在本地模式下运行时会有很多代码被绕过,这通常与任何其他集群管理器一起运行。 Amongst various issues, few things that i could think 在各种问题中,我能想到的事情很少
    a. 一种。 Inability to detect bugs from the way shuffle get handled .(Shuffle data is handled in a completely different way in local mode.) 无法通过shuffle处理方式检测错误。(在本地模式下,Shuffle数据以完全不同的方式处理。)
    b. We will not be able to detect serialization related issues , since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues. 我们将无法检测与序列化相关的问题 ,因为驱动程序可以使用所有代码,驱动程序本身也可以运行任务,因此我们不会导致任何序列化问题。
    c. C。 No speculative tasks (especially for write operations) 没有推测任务 (特别是对于写操作)
    d. d。 Networking related issues , all tasks are executed in same JVM. 与网络相关的问题 ,所有任务都在同一个JVM中执行。 One would not be able detect issues like communication between driver/executor, codegen related issues. 人们无法检测到驱动程序/执行程序之间的通信,代码相关问题等问题。
  2. Concurrency in local mode 本地模式下的并发
    a. 一种。 Max concurrency than can be attained will be equal to the number of cores in your local machine.( Link to code ) 可达到的最大并发数将等于本地计算机中的核心数。( 链接到代码
    b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running. Spark UI中显示的作业,阶段,任务度量标准不准确,因为它将导致在驱动程序也在运行的JVM中运行的开销。
    c: As for CPU/Memoryutilization, it depends on operation being performed. c:对于CPU / Memoryutilization,它取决于正在执行的操作。 Is the operation CPU/memory intensive? 操作CPU /内存是否密集?
  3. When to use local mode 何时使用本地模式
    a. 一种。 Testing of code that will run only on driver 测试仅在驱动程序上运行的代码
    b. Basic sanity testing of the code that will get executed on the executors 将在执行程序上执行的代码的基本健全性测试
    c. C。 Unit testing 单元测试

tl; TL; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode (There are lots of code which checks isLocal in code and control goes to a different code flow altogether) dr在本地模式下发生的并发错误甚至可能不存在于其他集群资源管理器中,因为本地模式的Spark代码中有许多特殊处理 (有很多代码在代码中检查isLocal并且控制转到不同的代码流一起)

Yes! 是! Achieving parallelism in local mode is quite possible. 在本地模式下实现并行是很有可能的。 Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job. 检查本地计算机中可用的内存量和CPU量,并在提交spark作业时为driver-memorydriver-cores配置提供值。

Increasing executor-memory and executor-cores will not make a difference in this mode. 增加executor-memoryexecutor-cores不会对此模式产生影响。

Once the application is running, open up the SPARK UI for the job. 应用程序运行后,打开作业的SPARK UI。 You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing. 您现在可以转到EXECUTORS选项卡以实际检查您的spark作业正在使用的资源量。

You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab. 您可以使用JOBSSTAGES选项卡监视生成的各种任务以及作业同时运行的任务数。

In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition . 为了处理大于可用资源的数据,请确保使用repartition分区将数据分成较小的分区。 This should allow your job to complete successfully. 这应该可以让您的工作顺利完成。

Increase the default shuffle partitions in case your job has aggregations or joins. 如果您的作业具有聚合或联接,请增加默认的随机分区。 Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk. 此外,确保本地文件系统上有足够的空间,因为spark会创建中间的shuffle文件并将它们写入磁盘。

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM