簡體   English   中英

將 Spark 2.4.5 升級到 Spark 3.3.2 導致 Shuffle 失敗

[英]Upgrading Spark 2.4.5 to Spark 3.3.2 Causing Shuffle Failures

我們用來在Spark 2.4.5 Standalone cluster上運行我們的作業。 我們將集群升級到Spark 3.3.2 ,並在將 App 代碼升級到 Spark 3.3.2 后開始在新集群上運行我們的作業。 大多數工作都很好。 但是,由於 Shuffle 錯誤,一些作業失敗了。 我試圖尋求幫助檢查已知問題。 但是,我們沒有發現任何有用的資源。 我確信它與 Memory 問題、執行器故障、工作器故障無關,我通常通過增加資源來解決這些問題

根據例外情況,執行程序無法從工作節點獲取 Shuffle 文件。 我們不確定它是來自遠程還是本地。 無論哪種方式,我們都沒有看到所有工作都失敗了。 只有一些工作每天都在失敗。

使用默認配置失敗的作業通過啟用External Shuffle Service工作。 出於可擴展性的原因,我們想禁用 External Shuffle Service 並使用默認配置運行作業。

有人可以請幫助我們調查這個問題。 謝謝。 下面是我們在失敗的作業中看到的異常。

"2022-07-14T15:21:26.781+0000" [WARN] {"logger":"scheduler.TaskSetManager", Lost task 3.0 in stage 40.1 (TID 82) (10.194.39.216 executor 11): FetchFailed(BlockManagerId(16, 10.194.39.216, 37299, None), shuffleId=24, mapIndex=2, mapId=47, reduceId=4, message=
org.apache.spark.shuffle.FetchFailedException
    at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:312)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1166)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:904)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
    at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:50)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.file.NoSuchFileException: /tmp/spark-c3eebc32-7801-45e8-b1f1-62dbd729df98/executor-251a979e-f27d-413a-ae48-3153508c55be/blockmgr-532637ac-156e-4e7e-9934-6e15bdaf9ed9/2f/shuffle_24_47_0.index
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
    at java.nio.file.Files.newByteChannel(Files.java:361)
    at java.nio.file.Files.newByteChannel(Files.java:407)
    at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:582)
    at org.apache.spark.storage.BlockManager.getHostLocalShuffleData(BlockManager.scala:673)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlock(ShuffleBlockFetcherIterator.scala:591)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2(ShuffleBlockFetcherIterator.scala:673)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2$adapted(ShuffleBlockFetcherIterator.scala:672)
    at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)
    at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)
    at scala.collection.immutable.List.forall(List.scala:91)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1(ShuffleBlockFetcherIterator.scala:672)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1$adapted(ShuffleBlockFetcherIterator.scala:671)
    at scala.collection.Iterator.forall(Iterator.scala:955)
    at scala.collection.Iterator.forall$(Iterator.scala:953)
    at scala.collection.AbstractIterator.forall(Iterator.scala:1431)
    at scala.collection.IterableLike.forall(IterableLike.scala:77)
    at scala.collection.IterableLike.forall$(IterableLike.scala:76)
    at scala.collection.AbstractIterable.forall(Iterable.scala:56)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchMultipleHostLocalBlocks(ShuffleBlockFetcherIterator.scala:671)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchHostLocalBlocks$6(ShuffleBlockFetcherIterator.scala:645)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchHostLocalBlocks$6$adapted(ShuffleBlockFetcherIterator.scala:640)
    at org.apache.spark.storage.HostLocalDirManager.$anonfun$getHostLocalDirs$1(BlockManager.scala:156)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
    at org.apache.spark.network.shuffle.BlockStoreClient$1.onSuccess(BlockStoreClient.java:170)
    at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:196)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    ... 1 more
)}

我看到了與您看到的相同的錯誤,並且能夠通過添加 "spark.shuffle.useOldFetchProtocol": "true" 到 spark 配置來修復它們。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM