![](/img/trans.png)
[英]Is it possible to create flat table from nested json object in AWS Athena?
[英]AWS Athena create table with nested json
每个人。 我有一个嵌套的 json 对象。 我正在尝试创建一个表,然后我将对其进行查询。 我正在努力寻找可能出错的地方。 我已经按照这篇文章中的建议进行了尝试,并按照本教程进行了操作,但尚未创建具有实际可读数据的表。
[{
"player": "Charlie",
"club": {
"position": "Attacking Midfield",
"competitor": "Bardsley",
"offense": [{
"shots": 13,
"goals": 1,
"close_range": 3
"fouls_against": 2
}, {
"shots": 13,
"goals": 1,
"close_range": 3
"fouls_against": 2
}
],
"defense": [{
"tackle": 0,
"interception": 1,
"blocked_shots": 0
"fouls": 5
}, {
"tackle": 3,
"interception": 4,
"blocked_shots": 3
"fouls": 6
}
],
},
"training_schedule": [
{
"training_name": "Piggy in the middle",
"coach": "Grant Wool"
"training_start": "2008-03-02T14:00:00.000Z"
}, {
"training_name": "Weight training",
"coach": "John Smith"
"training_start": "2008-03-02T16:00:00.000Z"
}, {
"training_name": "Tactical Video Session",
"coach": "Eusebius Pontiff"
"training_start": "2008-03-02T18:00:00.000Z"
}, {
"training_name": "Cross Country Run",
"coach": "John Smith"
"training_start": "2008-03-04T12:00:00.000Z"
}, {
"training_name": "Offensive Possession Play",
"coach": "Grant Wool"
"training_start": "2008-03-04T16:00:00.000Z"
}, {
"training_name": "Attacking Set Pieces",
"coach": "Grant Wool"
"training_start": "2008-03-05T12:00:00.000Z"
}, {
"training_name": "Practice game (6 a side)",
"coach": "Grant Wool"
"training_start": "2008-03-05T14:00:00.000Z"
}
]
}]
如您所见,这是一个具有各种优点的嵌套 json。 我正在尝试使用这些数据创建一个表格来寻找周末的最佳球员。 我遇到的问题是,当我加载这些数据并尝试创建表时,它失败了,并没有太清楚的信息说明原因。 这是我在 AWS Athena 上尝试过的:
CREATE EXTERNAL TABLE footie.players(
player array<struct<
player: string,
game_stats struct<
position: string,
competitor: string,
offense: array<struct<shots: int, goals: int, close_range: int, fouls_against: int>>,
defense: array<struct<tackle: int, interception: int, blocked_shots: int, fouls: int>>
>,
training_schedule: array<struct<
training_name: string,
coach: string
training_start: string>
>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'paths'='array')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://myprojects/footie.json'
我不断收到“服务:亚马逊;状态代码:400;错误代码:invalidrequestexception”。 爬虫给我空行的数据同样糟糕。 我不知道是否应该像其他帖子中建议的那样尝试更改文件格式,如果是这样,我应该采用什么正确的格式?
您在问题中发布的 JSON 记录缺少一些逗号,整个记录应显示在单行中,以便 Athena 正确查询表,如下所示:
[{"player":"Charlie","club":{"position":"Attacking Midfield","competitor":"Bardsley","offense":[{"shots":13,"goals":1,"close_range":3,"fouls_against":2},{"shots":13,"goals":1,"close_range":3,"fouls_against":2}],"defense":[{"tackle":0,"interception":1,"blocked_shots":0,"fouls":5},{"tackle":3,"interception":4,"blocked_shots":3,"fouls":6}]},"training_schedule":[{"training_name":"Piggy in the middle","coach":"Grant Wool","training_start":"2008-03-02T14:00:00.000Z"},{"training_name":"Weight training","coach":"John Smith","training_start":"2008-03-02T16:00:00.000Z"},{"training_name":"Tactical Video Session","coach":"Eusebius Pontiff","training_start":"2008-03-02T18:00:00.000Z"},{"training_name":"Cross Country Run","coach":"John Smith","training_start":"2008-03-04T12:00:00.000Z"},{"training_name":"Offensive Possession Play","coach":"Grant Wool","training_start":"2008-03-04T16:00:00.000Z"},{"training_name":"Attacking Set Pieces","coach":"Grant Wool","training_start":"2008-03-05T12:00:00.000Z"},{"training_name":"Practice game (6 a side)","coach":"Grant Wool","training_start":"2008-03-05T14:00:00.000Z"}]}]
现在您的 DDL 的文件名出现在应该只是文件夹的位置,即而不是LOCATION 's3://myprojects/footie.json'
它应该是LOCATION 's3://myprojects/'
并且您需要确保只有文件与此位置下存在的此表/模式相关。
一旦我做了这些更改并在查询下运行,我就能够预览数据。
CREATE EXTERNAL TABLE `test`(
`array` array<struct<player:string,club:struct<position:string,competitor:string,offense:array<struct<shots:int,goals:int,close_range:int,fouls_against:int>>,defense:array<struct<tackle:int,interception:int,blocked_shots:int,fouls:int>>>,training_schedule:array<struct<training_name:string,coach:string,training_start:string>>>> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='array')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://cvhgckgvk/'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.