在Spark中导入镶木地板文件时的内存问题

Question

I am trying to query data from parquet files in Scala Spark (1.5), including a query of 2 million rows ("variants" in the following code). 我正在尝试从Scala Spark（1.5）中的镶木地板文件中查询数据，包括200万行的查询（以下代码中的“变体”）。

val sqlContext = new org.apache.spark.sql.SQLContext(sc)  
sqlContext.sql("SET spark.sql.parquet.binaryAsString=true")

val parquetFile = sqlContext.read.parquet(<path>)

parquetFile.registerTempTable("tmpTable")
sqlContext.cacheTable("tmpTable")

val patients = sqlContext.sql("SELECT DISTINCT patient FROM tmpTable ...)

val variants = sqlContext.sql("SELECT DISTINCT ... FROM tmpTable ... )

This runs fine when the number of rows fetched is low, but fails with a "Size exceeds Integer.MAX_VALUE" error when lots of data is requested. 当获取的行数较少时，此运行正常，但在请求大量数据时，“大小超过Integer.MAX_VALUE”错误则失败。 The error looks as follows: 该错误如下所示：

User class threw exception: org.apache.spark.SparkException:
Job aborted due to stage failure: Task 43 in stage 1.0 failed 4 times,
most recent failure: Lost task 43.3 in stage 1.0 (TID 123, node009):
java.lang.RuntimeException: java.lang.IllegalArgumentException:
Size exceeds Integer.MAX_VALUE at
sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125) at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113) at ...

What can I do to make this work? 我能做些什么来完成这项工作？

This looks like a memory issue, but I have tried using up to 100 executors with no difference (the time it takes to fail stays the same no matter the number of executors involved, too). 这看起来像是一个内存问题，但我尝试使用多达100个执行程序，没有区别（无论涉及的执行程序数量多少，失败所用的时间都保持不变）。 It feels like the data isn't getting partitioned across the nodes? 感觉数据没有在节点之间进行分区？

I have attempted to force higher parallelization by naively replacing this line, to no avail: 我试图通过天真地替换这条线来强制更高的并行化，但无济于事：

val variants = sqlContext.sql("SELECT DISTINCT ... FROM tmpTable ... ).repartition(sc.defaultParallelism*10)

Answer 1

I don't believe the issue is parquet specific. 我不相信这个问题具有针对性。 You are "hitting" a limitation on the maximum size of a partition in Spark. 您正在“击中”Spark中分区的最大大小限制。

Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at ... 大小超过sun.nio.ch.FileChannelImpl.map（FileChannelImpl.java:828）处的Integer.MAX_VALUE ...

The Integer.MAX_VALUE detected that you have a size of (I believe) a partition that is more than 2GB (requires more than an int32 to index it). Integer.MAX_VALUE检测到您的大小（我相信）大于2GB的分区（需要多于int32来索引它）。

The comment from Joe Widen is spot on. Joe Widen的评论很有见。 You need to repartition your data even more. 您需要对数据进行更多重新分区。 Try 1000 or more. 尝试1000或更多。

Eg, 例如，

val data = sqlContext.read.parquet("data.parquet").rdd.repartition(1000).toDF

在Spark中导入镶木地板文件时的内存问题

问题描述

1 个解决方案

解决方案1
7 已采纳 2016-03-22 04:33:00

在Spark中导入镶木地板文件时的内存问题

问题描述

1 个解决方案

解决方案1 7 已采纳 2016-03-22 04:33:00

解决方案1
7 已采纳 2016-03-22 04:33:00