当我增加资源时，Spark Streaming Job OOM

Question

I've got a 4 node Spark Standalone cluster with a spark streaming job running on it. 我有一个4节点的Spark Standalone集群，上面运行着Spark Streaming作业。

When I submit the job with 7 cores per executor everything runs smoothly: 当我以每个执行者提交具有7个核心的作业时，一切运行顺利：

spark-submit --class com.test.StreamingJob --supervise --master spark://{SPARK_MASTER_IP}:7077 --executor-memory 30G --executor-cores 7 --total-executor-cores 28 /path/to/jar/spark-job.jar

When I increase to 24 cores per executor none of the batches get processed and I see java.lang.OutOfMemoryError: unable to create new native thread in the executor logs. 当我将每个执行器的内核数增加到24个时，没有批处理得到处理，并且我看到java.lang.OutOfMemoryError：无法在执行器日志中创建新的本机线程。 The executors then keep failing: 执行者然后继续失败：

spark-submit --class com.test.StreamingJob --supervise --master spark://{SPARK_MASTER_IP}:7077 --executor-memory 30G --executor-cores 24 --total-executor-cores 96 /path/to/jar/spark-job.jar

Error: 错误：

17/01/12 16:01:00 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Shutdown-checker,5,main]
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:534)
        at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:146)
        at io.netty.util.concurrent.AbstractEventExecutorGroup.shutdownGracefully(AbstractEventExecutorGroup.java:69)
        at com.datastax.driver.core.NettyOptions.onClusterClose(NettyOptions.java:190)
        at com.datastax.driver.core.Connection$Factory.shutdown(Connection.java:844)
        at com.datastax.driver.core.Cluster$Manager$ClusterCloseFuture$1.run(Cluster.java:2488)

I found this question and tried upping the ulimits substantially but it had no effect. 我发现了这个问题，并尝试大幅提高上限，但没有效果。

Each box has 32 cores and 61.8 GB memory. 每个盒子都有32个核心和61.8 GB内存。 The streaming job is written in java and running on Spark 2.0.0 connecting to Cassandra 3.7.0 with the spark-cassandra-connector-java_2.10 1.5.0-M2. 流作业是用Java编写的，并在Spark 2.0.0上运行，并通过spark-cassandra-connector-java_2.10 1.5.0-M2连接到Cassandra 3.7.0。

The data is a very small trickle of less than 100 events per second each of which are less than 200B. 数据是非常小的滴流，每秒少于100个事件，每个事件小于200B。

Answer 1

Sounds like you are running Out of Memory ;). 听起来您内存不足；）。

For a little more detail, the number of cores in use by Spark is directly tied to the amount of information being worked on in parallel. 更详细一点，Spark使用的内核数量与并行处理的信息量直接相关。 You can basically think about each Core as working on a full Spark Partition's data and can potentially require the full thing to reside in memory. 您基本上可以将每个Core视为正在处理完整的Spark Partition数据，并且可能需要将完整的内容驻留在内存中。

7 Cores per executor means 7 Spark Partitions are being worked on simultaneously. 每个执行程序7个内核意味着7个Spark分区正在同时工作。 Bumping this number up to 24 means roughly 4 times as much ram will be in use. 将该数字最多提高到24，意味着将使用大约4倍的内存。 This could easily cause an OOM in various places. 这很容易在各个地方引起OOM。

There are a few ways to deal with this. 有几种方法可以解决此问题。

Allocate more memory to the Executor JVMs 向执行器JVM分配更多内存
Shrink the size of the Spark Partitions (Smaller partitions means less data in memory at any given time) 缩小Spark分区的大小（较小的分区意味着在任何给定时间内存中的数据更少）
Make sure you aren't caching any RDDs in Memory (and thus exhausting the system resources) 确保您没有在内存中缓存任何RDD（从而耗尽了系统资源）
Reduce the amount of data you are working with, take subsets or try to filter at the server before hitting spark. 在遇到麻烦之前，请减少正在使用的数据量，获取子集或尝试在服务器上进行过滤。

当我增加资源时，Spark Streaming Job OOM

问题描述

1 个解决方案

解决方案1
0 2017-01-12 20:58:27

当我增加资源时，Spark Streaming Job OOM

问题描述

1 个解决方案

解决方案1 0 2017-01-12 20:58:27

解决方案1
0 2017-01-12 20:58:27