简体   繁体   中英

How do I create a schema for Hive to parse deeply nested json (Azure Application Insights output) using SerDe?

I'm trying to create a schema for hive to parse json, however, I am having trouble creating the schema when the json doc is in the following structure:

{
    "context": {        
        "custom": {
            "dimensions": [{
                "action": "GetFilters"
            },
            {
                "userId": "12345678"
            }]
        }
    }
}

I am using the Hadoop emulator for Azure's HDInsights on windows (8.1) and am using java (1.8.0_73). I compiled the SerDe successfully with Maven. I would think that the following would work:

add jar ../lib/json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar;

DROP TABLE events;
CREATE EXTERNAL TABLE events (
  context STRUCT<custom:STRUCT<dimensions:array<STRUCT<action:string>,STRUCT<userId:string>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/event';

When I take out the nested ARRAY>, then the schema parses ok, but with it in, I get the following exception:

MismatchedTokenException(282!=9) at org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecog nizer.java:617) at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HivePa rser.java:34909) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(Hi veParser.java:33113) at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java :36331) at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:35334 ) at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:35 054) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HivePa rser.java:34914) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(Hi veParser.java:33085) at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java :36331) at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:35334 ) at org.apache.hadoop.hive.ql.parse.HiveParser.colType (HiveParser.java:35 054) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser. java:34754) at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HivePar ser.java:32951) at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveP arser.java:4544) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.ja va:2144) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.j ava:1398) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java: 1036) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:19 9) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:16 6) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:409) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906) at org.apache.hadoop.hive.cli.CliDriver.pr ocessLocalCmd(CliDriver.java:2 68) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793 ) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) FAILED: ParseException line 2:69 missing > at ',' near 'STRUCT' in column specif ication line 2:76 mismatched input '<' expecting : near 'STRUCT' in column specification

 hive> 

That external table looked good to me. Maybe try downloading another distribution of that JSON serde. I have had success with: http://www.congiu.net/hive-json-serde/

I have had success in HDInsight 3.2 with http://www.congiu.net/hive-json-serde/1.3/cdh5/ but you might try a newer build for HDP.

Documentation here: https://github.com/rcongiu/Hive-JSON-Serde

I ended up getting it to work by removing the nested STRUCTs in the ARRAY STRUCT, but I have to access the values with [#]. For example, the following builds the schema:

DROP TABLE events;
CREATE EXTERNAL TABLE events (
  context STRUCT<custom:STRUCT<dimensions:ARRAY<STRUCT<action:string,userId:string>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/event';

Then I can access items such as the userId like so:

SELECT context.custom.dimensions.userId[1] FROM events;

It works, but is not as readable as I would prefer.

I used an array of maps to process the App Insights custom dimensions with rcongiu's JSON SerDe. This is using an HDInsight cluster with a linked blob storage account.

add jar wasb://<mycontainer>@<mystorage>.blob.core.windows.net/json-serde-1.3.7-jar-with-dependencies.jar;
DROP TABLE Logs;
CREATE EXTERNAL TABLE Logs (
  event array<struct<
    count:int, 
    name:string>
  >,
  context struct<
    custom:struct<
        dimensions:array<map<string, string>>
        >
  >
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'wasb://<mycontainer>@<mystorage>.blob.core.windows.net/events/';

SELECT 
    event[0].name as EventName,
    context.custom.dimensions[0]['action'] as Action, 
    context.custom.dimensions[1]['key'] as Key
FROM Logs
WHERE event[0].name = 'Click';

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM