[英]Is it possible to create flat table from nested json object in AWS Athena?
[英]AWS Athena create table with nested json
每個人。 我有一個嵌套的 json 對象。 我正在嘗試創建一個表,然后我將對其進行查詢。 我正在努力尋找可能出錯的地方。 我已經按照這篇文章中的建議進行了嘗試,並按照本教程進行了操作,但尚未創建具有實際可讀數據的表。
[{
"player": "Charlie",
"club": {
"position": "Attacking Midfield",
"competitor": "Bardsley",
"offense": [{
"shots": 13,
"goals": 1,
"close_range": 3
"fouls_against": 2
}, {
"shots": 13,
"goals": 1,
"close_range": 3
"fouls_against": 2
}
],
"defense": [{
"tackle": 0,
"interception": 1,
"blocked_shots": 0
"fouls": 5
}, {
"tackle": 3,
"interception": 4,
"blocked_shots": 3
"fouls": 6
}
],
},
"training_schedule": [
{
"training_name": "Piggy in the middle",
"coach": "Grant Wool"
"training_start": "2008-03-02T14:00:00.000Z"
}, {
"training_name": "Weight training",
"coach": "John Smith"
"training_start": "2008-03-02T16:00:00.000Z"
}, {
"training_name": "Tactical Video Session",
"coach": "Eusebius Pontiff"
"training_start": "2008-03-02T18:00:00.000Z"
}, {
"training_name": "Cross Country Run",
"coach": "John Smith"
"training_start": "2008-03-04T12:00:00.000Z"
}, {
"training_name": "Offensive Possession Play",
"coach": "Grant Wool"
"training_start": "2008-03-04T16:00:00.000Z"
}, {
"training_name": "Attacking Set Pieces",
"coach": "Grant Wool"
"training_start": "2008-03-05T12:00:00.000Z"
}, {
"training_name": "Practice game (6 a side)",
"coach": "Grant Wool"
"training_start": "2008-03-05T14:00:00.000Z"
}
]
}]
如您所見,這是一個具有各種優點的嵌套 json。 我正在嘗試使用這些數據創建一個表格來尋找周末的最佳球員。 我遇到的問題是,當我加載這些數據並嘗試創建表時,它失敗了,並沒有太清楚的信息說明原因。 這是我在 AWS Athena 上嘗試過的:
CREATE EXTERNAL TABLE footie.players(
player array<struct<
player: string,
game_stats struct<
position: string,
competitor: string,
offense: array<struct<shots: int, goals: int, close_range: int, fouls_against: int>>,
defense: array<struct<tackle: int, interception: int, blocked_shots: int, fouls: int>>
>,
training_schedule: array<struct<
training_name: string,
coach: string
training_start: string>
>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'paths'='array')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://myprojects/footie.json'
我不斷收到“服務:亞馬遜;狀態代碼:400;錯誤代碼:invalidrequestexception”。 爬蟲給我空行的數據同樣糟糕。 我不知道是否應該像其他帖子中建議的那樣嘗試更改文件格式,如果是這樣,我應該采用什么正確的格式?
您在問題中發布的 JSON 記錄缺少一些逗號,整個記錄應顯示在單行中,以便 Athena 正確查詢表,如下所示:
[{"player":"Charlie","club":{"position":"Attacking Midfield","competitor":"Bardsley","offense":[{"shots":13,"goals":1,"close_range":3,"fouls_against":2},{"shots":13,"goals":1,"close_range":3,"fouls_against":2}],"defense":[{"tackle":0,"interception":1,"blocked_shots":0,"fouls":5},{"tackle":3,"interception":4,"blocked_shots":3,"fouls":6}]},"training_schedule":[{"training_name":"Piggy in the middle","coach":"Grant Wool","training_start":"2008-03-02T14:00:00.000Z"},{"training_name":"Weight training","coach":"John Smith","training_start":"2008-03-02T16:00:00.000Z"},{"training_name":"Tactical Video Session","coach":"Eusebius Pontiff","training_start":"2008-03-02T18:00:00.000Z"},{"training_name":"Cross Country Run","coach":"John Smith","training_start":"2008-03-04T12:00:00.000Z"},{"training_name":"Offensive Possession Play","coach":"Grant Wool","training_start":"2008-03-04T16:00:00.000Z"},{"training_name":"Attacking Set Pieces","coach":"Grant Wool","training_start":"2008-03-05T12:00:00.000Z"},{"training_name":"Practice game (6 a side)","coach":"Grant Wool","training_start":"2008-03-05T14:00:00.000Z"}]}]
現在您的 DDL 的文件名出現在應該只是文件夾的位置,即而不是LOCATION 's3://myprojects/footie.json'
它應該是LOCATION 's3://myprojects/'
並且您需要確保只有文件與此位置下存在的此表/模式相關。
一旦我做了這些更改並在查詢下運行,我就能夠預覽數據。
CREATE EXTERNAL TABLE `test`(
`array` array<struct<player:string,club:struct<position:string,competitor:string,offense:array<struct<shots:int,goals:int,close_range:int,fouls_against:int>>,defense:array<struct<tackle:int,interception:int,blocked_shots:int,fouls:int>>>,training_schedule:array<struct<training_name:string,coach:string,training_start:string>>>> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='array')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://cvhgckgvk/'
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.