简体   繁体   English

HiveSQL访问JSON数组值

[英]HiveSQL access JSON-array values

I have a table in Hive, which is generated by reading from a Sequence File in my HDFS. 我在Hive中有一个表,该表是通过从HDFS中的序列文件读取生成的。 Those sequence files are json and look like this: 这些序列文件是json ,如下所示:

{"Activity":"Started","CustomerName":"CustomerName3","DeviceID":"StationRoboter","OrderID":"CustomerOrderID3","DateTime":"2018-11-27T12:56:47Z+0100","Color":[{"Name":"red","Amount":1},{"Name":"green","Amount":1},{"Name":"blue","Amount":1}],"BrickTotalAmount":3}

They submit product part colours and the amount of them which are counted in one service process run. 他们提交产品零件的颜色及其数量,这些颜色在一次服务过程中进行计数。

Please notice the json-array in color 请注意彩色json-array

Therefore my code to create the table is: 因此,我创建表的代码是:

CREATE EXTERNAL TABLE iotdata(
  activity              STRING,
  customername          STRING,
  deviceid              STRING,
  orderid               STRING,
  datetime              STRING,
  color                 ARRAY<MAP<String,String>>,
  bricktotalamount      STRING
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/IoTData/scray-data-000-v0';

This works, and if I do a select * on that table it looks like this: 这可行,如果我在该表上执行select * ,则如下所示:

在此处输入图片说明

But my problem is, that I have to access the data inside the color column for analysis. 但是我的问题是,我必须访问颜色列中的数据进行分析。 For example, I want to calc all red values in the table. 例如,我要计算表中的所有红色值。

So this leads to several opportunities and questions: how can I cast the amount string which is created to an integer? 因此,这带来了一些机会和问题:如何将创建的金额字符串转换为整数?

How can I access the data in my color-column via select? 如何通过select访问颜色列中的数据?

Or is there a possibility to change my table schema right at the beginning to get 4 extra columns for my 4 colours and 4 extra columns for the related colour amounts? 或者是否有可能在一开始就更改我的表模式,以便为我的4种颜色获得4个额外的列,为相关色量获得4个额外的列?

I also tried to read in the whole json as string to one column, and select the subcontent there, but this importing json array into hive leads me only to NULL values, propably because my json file is not 100% well-formed. 我还尝试将整个json作为字符串读入一列,然后在其中选择子内容,但是这种将json数组导入hive只会导致我得到NULL值,这可能是因为我的json文件格式不是100%正确。

The data inside of your array is definitely not a map for hive, you need to specify. 您需要指定数组中的数据绝对不是配置单元的映射。 I would recommend redefine your table specifying the structure of the array's data like this 我建议重新定义表,指定像这样的数组数据的结构

CREATE EXTERNAL TABLE iotdata(
  activity              STRING,
  customername          STRING,
  deviceid              STRING,
  orderid               STRING,
  datetime              STRING,
  color ARRAY<STRUCT<NAME: STRING,AMOUNT:BIGINT>>
  bricktotalamount      STRING
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '/IoTData/scray-data-000-v0';

in that way you should be able to the structure it self 这样,您应该能够自行构建结构

You can do this in two steps. 您可以分两步执行此操作。

Create proper JSON table 创建正确的JSON表

CREATE external TABLE temp.test_json (
  activity string,
  bricktotalamount int,
  color array<struct<amount:int, name:string>>,
  customername string,
  datetime string,
  deviceid string,
  orderid string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table'

在此处输入图片说明

Explode the Table in Select Statement 分解Select语句中的表

select activity, bricktotalamount, customername, datetime, deviceid, orderid, name, amount from temp.test_json
lateral view inline(color) c as amount,name

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM