Spark-如何将约20TB的数据从DataFrame写入Hive表或hdfs？

Question

我正在使用Spark处理20TB +的数据量。 我正在尝试使用以下命令将数据写入Hive表：

df.registerTempTable('temporary_table')
sqlContext.sql("INSERT OVERWRITE TABLE my_table SELECT * FROM temporary_table")

其中df是Spark DataFrame。 不幸的是，它没有任何日期可以划分。 当我运行上面的代码时，我遇到了错误消息：

py4j.protocol.Py4JJavaError：调用z：org.apache.spark.sql.execution.EvaluatePython.takeAndServe时发生错误。 ：org.apache.spark.SparkException：作业由于阶段失败而中止：95561任务的序列化结果的总大小（1024.0 MB）大于spark.driver.maxResultSize（1024.0 MB）

 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1844) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1857) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212) at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply$mcI$sp(python.scala:126) at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124) at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2087) at org.apache.spark.sql.execution.EvaluatePython$.takeAndServe(python.scala:124) at org.apache.spark.sql.execution.EvaluatePython.takeAndServe(python.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745)

错误消息似乎也取决于数据量。 对于较小的数据，我遇到以下错误消息

映射输出状态为395624469字节，超过了spark.akka.frameSize（134217728字节）。

有什么更实际的方法来实现这一目标（如果任务可行）？ 我正在使用Spark 1.6。

以下是spark-submit --deploy-mode cluster --master yarn --executor-memory 20G --num-executors 500 --driver-memory 64g --driver-cores 8 --files 'my_script.py'作业时的配置变量： spark-submit --deploy-mode cluster --master yarn --executor-memory 20G --num-executors 500 --driver-memory 64g --driver-cores 8 --files 'my_script.py'

顺便说一句，天真地我想当发生写操作时，Spark会将数据从执行程序写入hdfs。 但是错误消息似乎暗示执行器和驱动程序之间存在某些数据传输？

我只对Spark有很浅的了解，所以请原谅我这个愚蠢的问题！

Answer 1

检查以下配置并根据需要进行修改，默认值为1 g

由SparkConf设置：conf.set（“ spark.driver.maxResultSize”，“ 10g”）
由spark-defaults.conf设置：spark.driver.maxResultSize 10g
在调用spark-submit时设置：--conf spark.driver.maxResultSize = 10g
https://spark.apache.org/docs/latest/configuration.html

Spark-如何将约20TB的数据从DataFrame写入Hive表或hdfs？

问题描述

1 个解决方案

解决方案1
0 2018-06-18 00:31:32

Spark-如何将约20TB的数据从DataFrame写入Hive表或hdfs？

问题描述

1 个解决方案

解决方案1 0 2018-06-18 00:31:32

解决方案1
0 2018-06-18 00:31:32