[英]Spark Streaming job get killed after running for about 1 hour
I have a spark streaming job that read tweets stream from gnip and write it to Kafak.我有一个火花流作业,它从 gnip 读取推文流并将其写入 Kafak。
Spark and kafka are running on the same cluster. Spark 和 kafka 运行在同一个集群上。
My cluster consists of 5 nodes.我的集群由 5 个节点组成。 Kafka-b01 ... Kafka-b05
卡夫卡-b01 ... 卡夫卡-b05
Spark master is running on Kafak-b05. Spark master 在 Kafak-b05 上运行。
Here is how we submit the spark job这是我们提交 spark 作业的方式
nohup sh $SPZRK_HOME/bin/spark-submit --total-executor-cores 5 --class com.test.java.gnipStreaming.GnipSparkStreamer --master spark://kafka-b05:7077 GnipStreamContainer.jar powertrack kafka-b01,kafka-b02,kafka-b03,kafka-b04,kafka-b05 gnip_live_stream 2 & nohup sh $SPZRK_HOME/bin/spark-submit --total-executor-cores 5 --class com.test.java.gnipStreaming.GnipSparkStreamer --master spark://kafka-b05:7077 GnipStreamContainer.jar powertrack kafka-b01,卡夫卡-b02、卡夫卡-b03、卡夫卡-b04、卡夫卡-b05 gnip_live_stream 2 &
After about 1 hour the spark job get killed大约 1 小时后,火花工作被杀死
The logs in the nohub file shows the following exception nohub文件中的日志显示如下异常
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 2 locations. Most recent failure cause:
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:48)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.ChannelException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel
at io.netty.bootstrap.AbstractBootstrap$BootstrapChannelFactory.newChannel(AbstractBootstrap.java:455)
at io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:306)
at io.netty.bootstrap.Bootstrap.doConnect(Bootstrap.java:134)
at io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:116)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:211)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:99)
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
... 15 more
Caused by: io.netty.channel.ChannelException: Failed to open a socket.
at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:62)
at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:72)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at io.netty.bootstrap.AbstractBootstrap$BootstrapChannelFactory.newChannel(AbstractBootstrap.java:453)
... 26 more
Caused by: java.net.SocketException: Too many open files
at sun.nio.ch.Net.socket0(Native Method)
at sun.nio.ch.Net.socket(Net.java:411)
at sun.nio.ch.Net.socket(Net.java:404)
at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:105)
at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:60)
... 33 more
I have increased the maximum number of open files to 3275782 (the old number was almost half of this number) but I am still facing the same issue.我已将打开文件的最大数量增加到 3275782(旧数字几乎是这个数字的一半),但我仍然面临同样的问题。
When I checked the stderr
logs of the workers from spark web interface I found another exception.当我从 spark web 界面检查工作人员的
stderr
日志时,我发现了另一个异常。
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:74)
at kafka.producer.SyncProducer.send(SyncProducer.scala:119)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:59)
at kafka.producer.BrokerPartitionInfo.updateInfo(BrokerPartitionInfo.scala:82)
at kafka.producer.BrokerPartitionInfo.getBrokerPartitionInfo(BrokerPartitionInfo.scala:49)
at kafka.producer.async.DefaultEventHandler.kafka$producer$async$DefaultEventHandler$$getPartitionListForTopic(DefaultEventHandler.scala:188)
at kafka.producer.async.DefaultEventHandler$$anonfun$partitionAndCollate$1.apply(DefaultEventHandler.scala:152)
at kafka.producer.async.DefaultEventHandler$$anonfun$partitionAndCollate$1.apply(DefaultEventHandler.scala:151)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at kafka.producer.async.DefaultEventHandler.partitionAndCollate(DefaultEventHandler.scala:151)
at kafka.producer.async.DefaultEventHandler.dispatchSerializedData(DefaultEventHandler.scala:96)
at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:73)
at kafka.producer.Producer.send(Producer.scala:77)
at kafka.javaapi.producer.Producer.send(Producer.scala:33)
at com.test.java.gnipStreaming.GnipSparkStreamer$1$1.call(GnipSparkStreamer.java:59)
at com.test.java.gnipStreaming.GnipSparkStreamer$1$1.call(GnipSparkStreamer.java:51)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:225)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The second exception (as it seems) is related to Kafka not spark.第二个例外(看起来)与 Kafka 而不是 spark 有关。
What do you think the problem is?你认为问题是什么?
EDIT编辑
based on a comment from Yuval Itzchakov Here is the code of the streamer基于 Yuval Itzchakov 的评论 这是流光的代码
The main class http://pastebin.com/EcbnQQ3a主类http://pastebin.com/EcbnQQ3a
The customer receiver class http://pastebin.com/3UFPktKR客户接收器类http://pastebin.com/3UFPktKR
The problem is that you're instantiating a new instance of Producer
on the iteration of DStream.foreachPartition
.问题是您正在
DStream.foreachPartition
的迭代中实例化Producer
的新实例。 In case you have a data intensive stream, this can cause a-lot of producers to be allocated and attempt to connect to Kafka.如果您有数据密集型流,这可能会导致分配大量生产者并尝试连接到 Kafka。
The first thing I'd make sure is that you're properly closing the stream once you're done sending the data using a finally
block and calling producer.close
:我要确保的第一件事是,在使用
finally
块发送数据并调用producer.close
,您正在正确关闭流:
public Void call(JavaRDD<String> rdd) throws Exception {
rdd.foreachPartition(new VoidFunction<Iterator<String>>() {
@Override
public void call(Iterator<String> itr) throws Exception {
try
{
Producer<String, String> producer = getProducer(hosts);
while(itr.hasNext()) {
try {
KeyedMessage<String, String> message =
new KeyedMessage<String, String>(topic, itr.next());
producer.send(message);
} catch (Exception e) {
e.printStackTrace();
}
} finally {
producer.close()
}
}
});
return null;
}
If that still doesn't work and you're seeing too many connections, I'd create an object pool for Kafka producers which you can pool for on demand.如果这仍然不起作用并且您看到太多连接,我会为 Kafka 生产者创建一个对象池,您可以按需池。 That way, you explicitly control the number of available producers in use and the number of sockets you open.
这样,您可以明确控制正在使用的可用生产者的数量以及您打开的套接字数量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.