I have a table in Hive, which is generated by reading from a Sequence File in my HDFS. Those sequence files are json and look like this:
{"Activity":"Started","CustomerName":"CustomerName3","DeviceID":"StationRoboter","OrderID":"CustomerOrderID3","DateTime":"2018-11-27T12:56:47Z+0100","Color":[{"Name":"red","Amount":1},{"Name":"green","Amount":1},{"Name":"blue","Amount":1}],"BrickTotalAmount":3}
They submit product part colours and the amount of them which are counted in one service process run.
Please notice the json-array in color
Therefore my code to create the table is:
CREATE EXTERNAL TABLE iotdata(
activity STRING,
customername STRING,
deviceid STRING,
orderid STRING,
datetime STRING,
color ARRAY<MAP<String,String>>,
bricktotalamount STRING
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/IoTData/scray-data-000-v0';
This works, and if I do a select * on that table it looks like this:
But my problem is, that I have to access the data inside the color column for analysis. For example, I want to calc all red values in the table.
So this leads to several opportunities and questions: how can I cast the amount string which is created to an integer?
How can I access the data in my color-column via select?
Or is there a possibility to change my table schema right at the beginning to get 4 extra columns for my 4 colours and 4 extra columns for the related colour amounts?
I also tried to read in the whole json as string to one column, and select the subcontent there, but this importing json array into hive leads me only to NULL values, propably because my json file is not 100% well-formed.
The data inside of your array is definitely not a map for hive, you need to specify. I would recommend redefine your table specifying the structure of the array's data like this
CREATE EXTERNAL TABLE iotdata(
activity STRING,
customername STRING,
deviceid STRING,
orderid STRING,
datetime STRING,
color ARRAY<STRUCT<NAME: STRING,AMOUNT:BIGINT>>
bricktotalamount STRING
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/IoTData/scray-data-000-v0';
in that way you should be able to the structure it self
You can do this in two steps.
Create proper JSON table
CREATE external TABLE temp.test_json (
activity string,
bricktotalamount int,
color array<struct<amount:int, name:string>>,
customername string,
datetime string,
deviceid string,
orderid string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table'
Explode the Table in Select Statement
select activity, bricktotalamount, customername, datetime, deviceid, orderid, name, amount from temp.test_json
lateral view inline(color) c as amount,name
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.