Spark＆hbase：java.io.IOException：对等重置连接

Question

I would appreciate it if you could help me. 如果您能帮助我，我将不胜感激。

During implementation of spark streaming from kafka to hbase (code is attached) we have faced an issue “java.io.IOException: Connection reset by peer” (full log is attached). 在从kafka到hbase的Spark流实现的过程中（附加了代码），我们遇到了一个问题“ java.io.IOException：对等体重置连接”（附加了完整日志）。

This issue comes up if we work with hbase and dynamic allocation option is on in spark settings. 如果我们使用hbase并且在spark设置中启用了动态分配选项，则会出现此问题。 In case we write data in hdfs (hive table) instead of hbase or if dynamic allocation option is off there are no errors found. 如果我们在hdfs（配置单元表）而不是hbase中写入数据，或者如果动态分配选项关闭，则不会发现错误。

We have tried to change zookeeper connections, spark executor idle timeout, network timeout. 我们试图更改Zookeeper连接，触发执行程序空闲超时，网络超时。 We have tried to change shuffle block transfer service (NIO) but the error is still there. 我们试图更改随机播放块传输服务（NIO），但错误仍然存在。 If we set min/max executers (less then 80) amount for dynamic allocation there are no problems too. 如果我们为动态分配设置最小/最大执行器（少于80个）数量，也没有问题。

What may the problem be? 可能是什么问题？ There are a lot of almost the same problems in Jira and stack overflow, but nothing helps. 在Jira和堆栈溢出中，存在很多几乎相同的问题，但是没有任何帮助。

Versions: 版本：

HBase 1.2.0-cdh5.14.0
Kafka  3.0.0-1.3.0.0.p0.40
SPARK2 2.2.0.cloudera2-1.cdh5.12.0.p0.232957
hbase-client/hbase-spark(org.apache.hbase) 1.2.0-cdh5.11.1

Spark settings: 火花设置：

--num-executors=80
--conf spark.sql.shuffle.partitions=200
--conf spark.driver.memory=32g
--conf spark.executor.memory=32g
--conf spark.executor.cores=4

Cluster: 1+8 nodes, 70 CPU, 755Gb RAM, x10 HDD, 集群：1 + 8节点，70 CPU，755Gb RAM，x10 HDD，

Log: 日志：

    18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 717 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 717 successfully in removeExecutor
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 717 has been removed (new total is 26)
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 705.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 705 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 705 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 705 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(705, lang32.ca.sbrf.ru, 22805, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 705 has been removed (new total is 25)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 705 successfully in removeExecutor
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 716.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 716 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 716 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 716 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(716, lang32.ca.sbrf.ru, 28678, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 716 has been removed (new total is 24)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 716 successfully in removeExecutor
18/04/09 13:51:56 WARN server.TransportChannelHandler: Exception in connection from /10.116.173.65:57542
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
18/04/09 13:51:56 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.116.173.65:57542 is closed
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 548.

Answer 1

Please see my related answer here: What are possible reasons for receiving TimeoutException: Futures timed out after [n seconds] when working with Spark 请在此处查看我的相关答案：收到TimeoutException的可能原因是什么：使用Spark时，期货在[n秒]之后超时

It also took me a while to understand why Cloudera is stating following: 我还花了一段时间来了解为什么Cloudera声明以下内容：

Dynamic allocation and Spark Streaming 动态分配和Spark流

If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications. 如果使用的是Spark Streaming，则Cloudera建议通过在运行流应用程序时将spark.dynamicAllocation.enabled设置为false来禁用动态分配。

Reference: https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_dynamic_allocation_streaming 参考： https : //www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_dynamic_allocation_streaming

Answer 2

Try setting these two parameters. 尝试设置这两个参数。 Also try caching the Dataframe before writing to HBase. 还要在写入HBase之前尝试缓存 Dataframe 。

spark.network.timeout

spark.executor.heartbeatInterval

Spark＆hbase：java.io.IOException：对等重置连接

问题描述

2 个解决方案

解决方案1
0 2018-04-29 13:09:36

Dynamic allocation and Spark Streaming 动态分配和Spark流

解决方案2
0 2018-04-29 17:23:53

Spark＆hbase：java.io.IOException：对等重置连接

问题描述

2 个解决方案

解决方案1 0 2018-04-29 13:09:36

Dynamic allocation and Spark Streaming 动态分配和Spark流

解决方案2 0 2018-04-29 17:23:53

解决方案1
0 2018-04-29 13:09:36

解决方案2
0 2018-04-29 17:23:53