将Spark DataFrame写入Hive表时的内存分配问题

Question

I am trying to save a Spark DataFrame to a Hive table (Parquet) with .saveAsTable() in pySpark, but keep running in to memory issues like below: 我试图在pySpark中使用.saveAsTable()将Spark DataFrame保存到Hive表（Parquet），但继续运行到下面的内存问题：

org.apache.hadoop.hive.ql.metadata.HiveException: parquet.hadoop.MemoryManager$1:
New Memory allocation 1034931 bytes is smaller than the minimum allocation size of 1048576 bytes.

The first number ( 1034931 ) generally keeps changing in different runs. 第一个数字（ 1034931 ）通常在不同的运行中不断变化。 I recognize the second number ( 1048576 ) is 1024^2 , but I have little idea what that means here. 我认识到第二个数字（ 1048576 ）是1024^2 ，但我不知道这意味着什么。

I have been using the exact same technique for a few other of my projects (with much larger DataFrames), and it has worked without issue. 我一直在使用与我的一些其他项目完全相同的技术（具有更大的DataFrame），并且它没有问题。 Here I have essentially copy-pasted the structure of the process and configuration but runs in to the memory problem! 在这里，我基本上复制粘贴过程和配置的结构，但运行到内存问题！ It must be something trivial I am missing. 它一定是我失踪的微不足道的东西。

The Spark DataFrame (let's call it sdf ) has the structure (~10 columns and ~300k rows, but could be more if this runs correctly): Spark DataFrame（我们称之为sdf ）具有结构（~10列和~300k行，但如果运行正确则可能更多）：

+----------+----------+----------+---------------+---------------+
| col_a_str| col_b_num| col_c_num|partition_d_str|partition_e_str|
+----------+----------+----------+---------------+---------------+
|val_a1_str|val_b1_num|val_c1_num|     val_d1_str|     val_e1_str|
|val_a2_str|val_b2_num|val_c2_num|     val_d2_str|     val_e2_str|
|       ...|       ...|       ...|            ...|            ...|
+----------+----------+----------+---------------+---------------+

The Hive table was created like this: Hive表是这样创建的：

sqlContext.sql("""
                    CREATE TABLE IF NOT EXISTS my_hive_table (
                        col_a_str string,
                        col_b_num double,
                        col_c_num double
                    ) 
                    PARTITIONED BY (partition_d_str string,
                                    partition_e_str string)
                    STORED AS PARQUETFILE
               """)

The attempt at inserting data to this table is with the following command: 将数据插入此表的尝试使用以下命令：

sdf.write \
   .mode('append') \
   .partitionBy('partition_d_str', 'partition_e_str') \
   .saveAsTable('my_hive_table')

The Spark/Hive configuration is like this: Spark / Hive配置如下：

spark_conf = pyspark.SparkConf()
spark_conf.setAppName('my_project')

spark_conf.set('spark.executor.memory', '16g')
spark_conf.set('spark.python.worker.memory', '8g')
spark_conf.set('spark.yarn.executor.memoryOverhead', '15000')
spark_conf.set('spark.dynamicAllocation.maxExecutors', '64')
spark_conf.set('spark.executor.cores', '4')

sc = pyspark.SparkContext(conf=spark_conf)

sqlContext = pyspark.sql.HiveContext(sc)
sqlContext.setConf('hive.exec.dynamic.partition', 'true')
sqlContext.setConf('hive.exec.max.dynamic.partitions', '5000')
sqlContext.setConf('hive.exec.dynamic.partition.mode', 'nonstrict')
sqlContext.setConf('hive.exec.compress.output', 'true')

I have tried changing the .partitionBy('partition_d_str', 'partition_e_str') to .partitionBy(['partition_d_str', 'partition_e_str']) , increasing memory, splitting the DataFrame to smaller chunks, re-creating the tables and DataFrame, but nothing seems to work. 我尝试将.partitionBy('partition_d_str', 'partition_e_str')更改为.partitionBy(['partition_d_str', 'partition_e_str']) ，增加内存，将DataFrame拆分为更小的块，重新创建表和DataFrame，但是似乎没什么用。 I can't find any solutions online either. 我也无法在线找到任何解决方案。 What would be causing the memory error (I don't fully understand where it's coming from either), and how can I change my code to write to the Hive table? 什么会导致内存错误（我不完全了解它来自哪里），以及如何更改我的代码以写入Hive表？ Thanks. 谢谢。

Answer 1

It turns out I was partitioning with a nullable field that was throwing the .saveAsTable() off. 事实证明我正在使用一个可以为空的字段进行分区，该字段抛出了.saveAsTable() 。 When I was converting the RDD to a Spark DataFrame, the schema I was providing was generated like this: 当我将RDD转换为Spark DataFrame时，我提供的架构生成如下：

from pyspark.sql.types import *

# Define schema
my_schema = StructType(
                    [StructField('col_a_str', StringType(), False),
                     StructField('col_b_num', DoubleType(), True),
                     StructField('col_c_num', DoubleType(), True),
                     StructField('partition_d_str', StringType(), False),
                     StructField('partition_e_str', StringType(), True)])

# Convert RDD to Spark DataFrame
sdf = sqlContext.createDataFrame(my_rdd, schema=my_schema)

Since partition_e_str was declared as nullable=True (the third argument for that StructField ), it had issues when writing to the Hive table because it was being used as one of the partitioning fields. 由于partition_e_str被声明为nullable=True （该StructField的第三个参数），因此在写入Hive表时会出现问题，因为它被用作其中一个分区字段。 I changed it to: 我改成了：

# Define schema
my_schema = StructType(
                    [StructField('col_a_str', StringType(), False),
                     StructField('col_b_num', DoubleType(), True),
                     StructField('col_c_num', DoubleType(), True),
                     StructField('partition_d_str', StringType(), False),
                     StructField('partition_e_str', StringType(), False)])

and all was well again! 一切都很好！

Lesson: Make sure your partitioning fields are not nullable! 课程：确保您的分区字段不可为空！

将Spark DataFrame写入Hive表时的内存分配问题

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-05-19 02:01:52

将Spark DataFrame写入Hive表时的内存分配问题

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-05-19 02:01:52

解决方案1
3 已采纳 2017-05-19 02:01:52