Spark/scala 中的 SQL 查询大小超过 Integer.MAX_VALUE

Question

I am trying to create a simple sql query on S3 events using Spark.我正在尝试使用 Spark 在 S3 事件上创建一个简单的 sql 查询。 I am loading ~30GB of JSON files as following:我正在加载 ~30GB 的 JSON 文件，如下所示：

val d2 = spark.read.json("s3n://myData/2017/02/01/1234");
d2.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK);
d2.registerTempTable("d2");

Then I am trying to write to file the result of my query:然后我试图写入文件我的查询结果：

val users_count = sql("select count(distinct data.user_id) from d2");
users_count.write.format("com.databricks.spark.csv").option("header", "true").save("s3n://myfolder/UsersCount.csv");

But Spark is throwing the following exception:但 Spark 抛出以下异常：

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:439)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:672)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Note that the same query works for smaller amounts of data.请注意，相同的查询适用于更少量的数据。 What's the problem here?这里有什么问题？

Answer 1

No Spark shuffle block can be larger than 2GB (Integer.MAX_VALUE bytes) so you need more / smaller partitions.没有 Spark shuffle 块可以大于 2GB（Integer.MAX_VALUE 字节），因此您需要更多/更小的分区。

You should adjust spark.default.parallelism and spark.sql.shuffle.partitions (default 200) such that the number of partitions can accommodate your data without reaching the 2GB limit (you could try aiming for 256MB / partition so for 200GB you get 800 partitions).您应该调整 spark.default.parallelism 和 spark.sql.shuffle.partitions（默认为 200），以便分区数量可以容纳您的数据而不会达到 2GB 的限制（您可以尝试瞄准 256MB/分区，因此对于 200GB，您将获得 800分区）。 Thousands of partitions is very common so don't be afraid to repartition to 1000 as suggested.数千个分区很常见，所以不要害怕按照建议重新分区到 1000 个。

FYI, you may check the number of partitions for an RDD with something like rdd.getNumPartitions (ie d2.rdd.getNumPartitions)仅供参考，您可以使用 rdd.getNumPartitions（即 d2.rdd.getNumPartitions）之类的东西检查 RDD 的分区数

There's a story to track the effort of addressing the various 2GB limits (been open for a while now): https://issues.apache.org/jira/browse/SPARK-6235有一个故事来跟踪解决各种 2GB 限制的努力（现在已经开放了一段时间）： https : //issues.apache.org/jira/browse/SPARK-6235

See http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications/25 for more info on this error.有关此错误的更多信息，请参阅http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications/25 。

Answer 2

When I use the Spark core processing 200G of data, set up --conf spark.default.parallelism = 2000 and .repartition(100) , but the error will appear, finally, I use the following Settings to solve:当我使用Spark核心处理200G的数据时，设置了--conf spark.default.parallelism = 2000和.repartition(100) ，但是会出现错误，最后我使用如下设置解决：

val conf = new SparkConf()
         .setAppName(appName)
         .set("spark.rdd.compress", "true")

Description of spark.rdd.compress spark.rdd.compress 的描述

I hope it helps you我希望它能帮助你

Spark/scala 中的 SQL 查询大小超过 Integer.MAX_VALUE

问题描述

2 个解决方案

解决方案1
60 已采纳 2017-02-15 21:29:02

解决方案2
0 2020-06-10 16:03:07

Spark/scala 中的 SQL 查询大小超过 Integer.MAX_VALUE

问题描述

2 个解决方案

解决方案1 60 已采纳 2017-02-15 21:29:02

解决方案2 0 2020-06-10 16:03:07

解决方案1
60 已采纳 2017-02-15 21:29:02

解决方案2
0 2020-06-10 16:03:07