简体   繁体   中英

HiveSQL access JSON-array values

I have a table in Hive, which is generated by reading from a Sequence File in my HDFS. Those sequence files are json and look like this:

{"Activity":"Started","CustomerName":"CustomerName3","DeviceID":"StationRoboter","OrderID":"CustomerOrderID3","DateTime":"2018-11-27T12:56:47Z+0100","Color":[{"Name":"red","Amount":1},{"Name":"green","Amount":1},{"Name":"blue","Amount":1}],"BrickTotalAmount":3}

They submit product part colours and the amount of them which are counted in one service process run.

Please notice the json-array in color

Therefore my code to create the table is:

CREATE EXTERNAL TABLE iotdata(
  activity              STRING,
  customername          STRING,
  deviceid              STRING,
  orderid               STRING,
  datetime              STRING,
  color                 ARRAY<MAP<String,String>>,
  bricktotalamount      STRING
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/IoTData/scray-data-000-v0';

This works, and if I do a select * on that table it looks like this:

在此处输入图片说明

But my problem is, that I have to access the data inside the color column for analysis. For example, I want to calc all red values in the table.

So this leads to several opportunities and questions: how can I cast the amount string which is created to an integer?

How can I access the data in my color-column via select?

Or is there a possibility to change my table schema right at the beginning to get 4 extra columns for my 4 colours and 4 extra columns for the related colour amounts?

I also tried to read in the whole json as string to one column, and select the subcontent there, but this importing json array into hive leads me only to NULL values, propably because my json file is not 100% well-formed.

The data inside of your array is definitely not a map for hive, you need to specify. I would recommend redefine your table specifying the structure of the array's data like this

CREATE EXTERNAL TABLE iotdata(
  activity              STRING,
  customername          STRING,
  deviceid              STRING,
  orderid               STRING,
  datetime              STRING,
  color ARRAY<STRUCT<NAME: STRING,AMOUNT:BIGINT>>
  bricktotalamount      STRING
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/IoTData/scray-data-000-v0';

in that way you should be able to the structure it self

You can do this in two steps.

Create proper JSON table

CREATE external TABLE temp.test_json (
  activity string,
  bricktotalamount int,
  color array<struct<amount:int, name:string>>,
  customername string,
  datetime string,
  deviceid string,
  orderid string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table'

在此处输入图片说明

Explode the Table in Select Statement

select activity, bricktotalamount, customername, datetime, deviceid, orderid, name, amount from temp.test_json
lateral view inline(color) c as amount,name

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM