簡體   English   中英

火花錯誤:java.lang.IllegalArgumentException:大小超過Integer.MAX_VALUE

[英]spark error:java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

我嘗試計算負樣本的數量,如下所示:

val numNegatives = dataSet.filter(col("label") < 0.5).count

但我收到一個大小超過Integer.MAX_VALUE的錯誤:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1239)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
    at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:512)
    at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
    at org.apache.spark.storage.BlockManager.get(BlockManager.scala:636)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

一些解析者建議添加分區號,因此我將上述代碼更新如下:

val data = dataSet.repartition(5000).cache()
val numNegatives = data.filter(col("label") < 0.5).count

但是它報告了同樣的錯誤! 幾天讓我感到困惑。 誰能幫我? 謝謝。

嘗試在過濾器之前重新分區:

val numNegatives = dataSet.repartition(1000).filter(col(“ label”)<0.5).count

篩選器將使用原始DataSet分區執行,然后將結果重新分區。 您需要使用較小的分區來進行過濾。

她的問題是在實現后,ShuffleRDD塊的大小大於2GB。 Spark有此限制 您需要更改spark.sql.shuffle.partitions參數,該參數默認設置為200。

另外,您可能需要增加數據集具有的分區數。 重新分區並首先保存,然后讀取新數據集並執行操作。

spark.sql("SET spark.sql.shuffle.partitions = 10000")
dataset.repartition(10000).write.parquet("/path/to/hdfs")
val newDataset = spark.read.parquet("/path/to/hdfs")  
newDatase.filter(...).count

或者,如果您想使用配置單元表

spark.sql("SET spark.sql.shuffle.partitions = 10000")
dataset.repartition(10000).asveAsTable("newTableName")
val newDataset = spark.table("newTableName")  
newDatase.filter(...).count              

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM