简体   繁体   English

当我增加资源时,Spark Streaming Job OOM

[英]Spark Streaming Job OOM when I increase resources

I've got a 4 node Spark Standalone cluster with a spark streaming job running on it. 我有一个4节点的Spark Standalone集群,上面运行着Spark Streaming作业。

When I submit the job with 7 cores per executor everything runs smoothly: 当我以每个执行者提交具有7个核心的作业时,一切运行顺利:

spark-submit --class com.test.StreamingJob --supervise --master spark://{SPARK_MASTER_IP}:7077 --executor-memory 30G --executor-cores 7 --total-executor-cores 28 /path/to/jar/spark-job.jar

When I increase to 24 cores per executor none of the batches get processed and I see java.lang.OutOfMemoryError: unable to create new native thread in the executor logs. 当我将每个执行器的内核数增加到24个时,没有批处理得到处理,并且我看到java.lang.OutOfMemoryError:无法在执行器日志中创建新的本机线程。 The executors then keep failing: 执行者然后继续失败:

spark-submit --class com.test.StreamingJob --supervise --master spark://{SPARK_MASTER_IP}:7077 --executor-memory 30G --executor-cores 24 --total-executor-cores 96 /path/to/jar/spark-job.jar

Error: 错误:

17/01/12 16:01:00 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Shutdown-checker,5,main]
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:534)
        at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:146)
        at io.netty.util.concurrent.AbstractEventExecutorGroup.shutdownGracefully(AbstractEventExecutorGroup.java:69)
        at com.datastax.driver.core.NettyOptions.onClusterClose(NettyOptions.java:190)
        at com.datastax.driver.core.Connection$Factory.shutdown(Connection.java:844)
        at com.datastax.driver.core.Cluster$Manager$ClusterCloseFuture$1.run(Cluster.java:2488)

I found this question and tried upping the ulimits substantially but it had no effect. 我发现了这个问题,并尝试大幅提高上限,但没有效果。

Each box has 32 cores and 61.8 GB memory. 每个盒子都有32个核心和61.8 GB内存。 The streaming job is written in java and running on Spark 2.0.0 connecting to Cassandra 3.7.0 with the spark-cassandra-connector-java_2.10 1.5.0-M2. 流作业是用Java编写的,并在Spark 2.0.0上运行,并通过spark-cassandra-connector-java_2.10 1.5.0-M2连接到Cassandra 3.7.0。

The data is a very small trickle of less than 100 events per second each of which are less than 200B. 数据是非常小的滴流,每秒少于100个事件,每个事件小于200B。

Sounds like you are running Out of Memory ;). 听起来您内存不足;)。

For a little more detail, the number of cores in use by Spark is directly tied to the amount of information being worked on in parallel. 更详细一点,Spark使用的内核数量与并行处理的信息量直接相关。 You can basically think about each Core as working on a full Spark Partition's data and can potentially require the full thing to reside in memory. 您基本上可以将每个Core视为正在处理完整的Spark Partition数据,并且可能需要将完整的内容驻留在内存中。

7 Cores per executor means 7 Spark Partitions are being worked on simultaneously. 每个执行程序7个内核意味着7个Spark分区正在同时工作。 Bumping this number up to 24 means roughly 4 times as much ram will be in use. 将该数字最多提高到24,意味着将使用大约4倍的内存。 This could easily cause an OOM in various places. 这很容易在各个地方引起OOM。

There are a few ways to deal with this. 有几种方法可以解决此问题。

  1. Allocate more memory to the Executor JVMs 向执行器JVM分配更多内存
  2. Shrink the size of the Spark Partitions (Smaller partitions means less data in memory at any given time) 缩小Spark分区的大小(较小的分区意味着在任何给定时间内存中的数据更少)
  3. Make sure you aren't caching any RDDs in Memory (and thus exhausting the system resources) 确保您没有在内存中缓存任何RDD(从而耗尽了系统资源)
  4. Reduce the amount of data you are working with, take subsets or try to filter at the server before hitting spark. 在遇到麻烦之前,请减少正在使用的数据量,获取子集或尝试在服务器上进行过滤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何解决Spark Streaming程序中的“运行作业流作业时出错”? - How to resolved “Error running job streaming job” in Spark streaming program? 在Spark Streaming作业中启动一次JDBC连接 - Initiate JDBC connection once in Spark Streaming job Spark流作业已退出,代码为11 - Spark streaming job exited with code 11 Spark Streaming Kafka - 当RDD包含实际消息时,Job总是退出 - Spark Streaming Kafka - Job always quits when RDD contains an actual message Java 中的 Spark 作业:如何在集群上运行时从“资源”访问文件 - Spark job in Java: how to access files from 'resources' when run on a cluster 将其发送到Spark Streaming时如何保持JSON结构 - How to keep a JSON structure when I send it to Spark Streaming 从Apache Spark Streaming上下文访问JAR中资源目录中的文件 - Access files in resources directory in JAR from Apache Spark Streaming context Spark Streaming作业如何在Kafka主题上发送数据并将其保存在Elastic中 - Spark Streaming job how to send data on Kafka topic and saving it in Elastic 如果某些Kafka节点时间偏移未同步,则Spark流式传输作业会停止 - Spark streaming job stuck if some Kafka nodes time offset is not synchronized Mahout spark-shell初始作业未接受任何资源 - Mahout spark-shell Initial job has not accepted any resources
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM