Spark Hive上下文-具有分区和大写字段名称的Avro表

Question

For partitioned Avro Hive tables, field names that have uppercase characters in the Avro schema are being pulled back as null. 对于分区的Avro Hive表，在Avro模式中具有大写字符的字段名称将被拉回为null。 I was wondering if there is some setting/workaround I am missing, or if this is just a bug with the Hive Context. 我想知道是否缺少某些设置/解决方法，或者这仅仅是Hive上下文中的错误。

I've already tried adding the following to the DDL: 我已经尝试将以下内容添加到DDL：

 WITH SERDEPROPERTIES ('casesensitive'='FieldName')

... and setting spark.sql.caseSensitive to true/false ...并将spark.sql.caseSensitive设置为true / false

Spark version 1.5.0 Hive version 1.1.0 Spark版本1.5.0 Hive版本1.1.0

You can recreate the issue by running the following DDL in Hive: 您可以通过在Hive中运行以下DDL重新创建问题：

-- Hive DDL using partitions
CREATE TABLE avro_partitions (Field string)
PARTITIONED BY (part string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'=
  '{ "type":"record", "name":"avro_partitions", "namespace":"default", "fields":[ {"name":"Field", "type":"string"} ] }');
INSERT INTO avro_partitions PARTITION (part='01') VALUES('test');

-- Hive DDL without partitions
CREATE TABLE avro_no_partitions (Field string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'=
  '{ "type":"record", "name":"avro_no_partitions", "namespace":"default", "fields":[ {"name":"Field", "type":"string"} ] }');
INSERT INTO avro_no_partitions VALUES('test');

... & then trying to select from the tables using Spark SQL (spark-shell): ...然后尝试使用Spark SQL（spark-shell）从表中进行选择：

sqlContext.sql("select * from default.avro_partitions").show
+-----+----+
|field|part|
+-----+----+
| null|  01|
+-----+----+

sqlContext.sql("select * from default.avro_no_partitions").show
+-----+
|field|
+-----+
| test|
+-----+

Answer 1

The issue is specifying the avro.schema.literal in the TBLPROPERTIES - it should be specified in the SERDEPROPERTIES : 该问题是指定avro.schema.literal在TBLPROPERTIES -它应该在SERDEPROPERTIES指定：

CREATE TABLE avro_partitions (Field string)
PARTITIONED BY (part string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type":"record", "name":"avro_partitions", "namespace":"default", "fields":[ {"name":"Field", "type":"string"} ] }')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
INSERT INTO avro_partitions PARTITION (part='01') VALUES('test');

Spark version 1.6.0 Spark版本1.6.0

Spark Hive上下文-具有分区和大写字段名称的Avro表

问题描述

1 个解决方案

解决方案1
0 2017-01-27 20:30:43

Spark Hive上下文-具有分区和大写字段名称的Avro表

问题描述

1 个解决方案

解决方案1 0 2017-01-27 20:30:43

解决方案1
0 2017-01-27 20:30:43