Hive Serde处理嵌套结构的问题

Question

I am trying to load a huge volume json data with nested structure to hive using a Json serde. 我正在尝试使用Json Serde加载具有嵌套结构的大量json数据以配置单元。 some of the field names start with $ in nested structure. 某些字段名称在嵌套结构中以$开头。 I am mapping hive filed names Using SerDeproperties , but how ever when i query the table, getting null in the field starting with $ , tried with different syntax,but no luck. 我正在使用SerDeproperties映射蜂巢文件名，但是当我查询表时， SerDeproperties ，以$开头的字段为null，尝试使用不同的语法，但是没有运气。

Sample JSON: 样本JSON：

{
    "_id" : "319FFE15FF90",
    "SomeThing" : 
    {
            "$SomeField"     : 22,
            "AnotherField"   : 2112,
            "YetAnotherField":    1
    }
 . . . etc . . . .

Using a schema as follows: 使用如下模式：

create table testSample
( 
    `_id` string, 
    something struct
    <
        $somefield:int,
        anotherfield:bigint, 
        yetanotherfield:int
    >
) 
row format serde 'org.openx.data.jsonserde.JsonSerDe' 
with serdeproperties
(
    "mapping.somefield" = "$somefield"
);

This schema builds OK, however, somefield(starting with $ ) in the above table is always returning null (all the other values exist and are correct). 此架构构建良好，但是，上表中的somefield（以$开头）始终返回null（所有其他值都存在且正确）。

We've been trying a lot of syntax combinations, but to no avail. 我们一直在尝试许多语法组合，但无济于事。

Does anyone know the trick to hap a nested field with a leading $ in its name? 有谁知道在嵌套字段中使用名字开头的$的窍门吗？

Answer 1

You almost got it right. 你几乎是对的。 Try creating the table like this. 尝试像这样创建表。 The mistake you're making is that when mapping in the serde properties (mapping.somefield ="$somefield") you're saying "when looking for the hive column named 'somefield', look for the json field '$somefield', but in hive you defined the column with the dollar sign, which if not outright illegal it's for sure not the best practice in hive. 您犯的错误是，在Serde属性中进行映射（mapping.somefield =“ $ somefield”）时，您说的是“当寻找名为'somefield'的配置单元列时，寻找json字段'$ somefield'，但是在蜂巢中，您用美元符号定义了该列，如果不是完全违法的，那肯定不是蜂巢中的最佳实践。

create table testSample
(
`_id` string,
something struct
<
    somefield:int,
    anotherfield:bigint,
    yetanotherfield:int
  >
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);

I tested it with some test data: 我用一些测试数据进行了测试：

{ "_id" : "123", "something": { "$somefield": 12, "anotherfield":13,"yetanotherfield":100}}
hive> select something.somefield from testSample;
OK
12

Answer 2

I am suddenly starting to see this problem as well but for normal column names as well (no special characters such as $) 我突然也开始看到此问题，但对于普通的列名也是如此（没有特殊字符，如$）

I am populating an external table (Temp) from another internal table (Table2) and want the output of Temp table in JSON format. 我正在从另一个内部表（Table2）填充外部表（Temp），并希望以JSON格式输出Temp表。 I want column names in camel case in the output JSON file and so am also using the Serdepoperties in the Temp table to specify correct names. 我希望在输出JSON文件中使用驼峰式的列名，因此我也使用Temp表中的Serdepoperties指定正确的名称。 However, I am seeing that when I do Select * from the Temp table, it gives NULL values for the columns whose names have been used in the mapping. 但是，我看到当我从Temp表中选择*时，它将为名称已在映射中使用的列提供NULL值。

I am running Hive 0.13. 我正在运行Hive 0.13。 Here are the commands: 以下是命令：

Create table command: 创建表命令：

CREATE EXTERNAL TABLE Temp (
    data STRUCT<
        customerId:BIGINT, region:STRING, marketplaceId:INT, asin:ARRAY<STRING>>
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
    'mapping.customerid' = 'customerId',
    'mapping.marketplaceid' = 'marketplaceId'
) 
LOCATION '/output'; 

INSERT INTO TABLE Temp
    SELECT 
        named_struct ('customerId',customerId, 'region', region, 'marketplaceId', marketplaceId, 'asin', asin) 
    FROM Table2;

Select * from Temp: 从温度中选择*：

{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1PZC"]}
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1C9G"]}

See how "customerid" and "marketplaceid" are null. 查看“ customerid”和“ marketplaceid”如何为空。 Generated JSON file is: 生成的JSON文件为：

{"data":{"region":"EU","asin":["B000FC1PZC"]}}
{"data":{"region":"EU","asin":["B000FC1C9G"]}}

Now, if I remove the with serdeproperties, the table starts getting all values: 现在，如果我删除带有serdeproperties的表，该表将开始获取所有值：

{"customerid":1,"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"]}
{"customerid":2,"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"]}

And then the JSON file so generated is: 然后，这样生成的JSON文件为：

{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"],"customerid":1}}
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"],"customerid":2}}

Hive Serde处理嵌套结构的问题

问题描述

2 个解决方案

解决方案1
2 2015-12-30 19:52:49

解决方案2
0 2016-01-16 00:44:38

Hive Serde处理嵌套结构的问题

问题描述

2 个解决方案

解决方案1 2 2015-12-30 19:52:49

解决方案2 0 2016-01-16 00:44:38

解决方案1
2 2015-12-30 19:52:49

解决方案2
0 2016-01-16 00:44:38