AWS Athena 使用嵌套的 json 創建表

Question

每個人。 我有一個嵌套的 json 對象。 我正在嘗試創建一個表，然后我將對其進行查詢。 我正在努力尋找可能出錯的地方。 我已經按照這篇文章中的建議進行了嘗試，並按照本教程進行了操作，但尚未創建具有實際可讀數據的表。

[{              
    "player": "Charlie",            
    "club": {           
        "position": "Attacking Midfield",       
        "competitor": "Bardsley",       
        "offense": [{       
                "shots": 13,
                "goals": 1,
                "close_range": 3
                "fouls_against": 2
            }, {    
                "shots": 13,
                "goals": 1,
                "close_range": 3
                "fouls_against": 2
            }   
        ],      
        "defense": [{       
                "tackle": 0,
                "interception": 1,
                "blocked_shots": 0
                "fouls": 5
            }, {    
                "tackle": 3,
                "interception": 4,
                "blocked_shots": 3
                "fouls": 6
            }   
        ],      
    },          
    "training_schedule": [
        {           
            "training_name": "Piggy in the middle",
            "coach": "Grant Wool"
            "training_start": "2008-03-02T14:00:00.000Z"
        }, {    
            "training_name": "Weight training",
            "coach": "John Smith"
            "training_start": "2008-03-02T16:00:00.000Z"
        }, {    
            "training_name": "Tactical Video Session",
            "coach": "Eusebius Pontiff"
            "training_start": "2008-03-02T18:00:00.000Z"
        }, {    
            "training_name": "Cross Country Run",
            "coach": "John Smith"
            "training_start": "2008-03-04T12:00:00.000Z"
        }, {    
            "training_name": "Offensive Possession Play",
            "coach": "Grant Wool"
            "training_start": "2008-03-04T16:00:00.000Z"
        }, {    
            "training_name": "Attacking Set Pieces",
            "coach": "Grant Wool"
            "training_start": "2008-03-05T12:00:00.000Z"
        }, {    
            "training_name": "Practice game (6 a side)",
            "coach": "Grant Wool"
            "training_start": "2008-03-05T14:00:00.000Z"
        }   
    ]   
}]

如您所見，這是一個具有各種優點的嵌套 json。 我正在嘗試使用這些數據創建一個表格來尋找周末的最佳球員。 我遇到的問題是，當我加載這些數據並嘗試創建表時，它失敗了，並沒有太清楚的信息說明原因。 這是我在 AWS Athena 上嘗試過的：

CREATE EXTERNAL TABLE footie.players( 
 player array<struct< 
  player: string,
  game_stats struct<
        position: string,
                competitor: string,
                offense: array<struct<shots: int, goals: int, close_range: int, fouls_against: int>>,
                defense: array<struct<tackle: int, interception: int, blocked_shots: int, fouls: int>>
                   >,
  training_schedule: array<struct<
        training_name: string,
        coach: string
        training_start: string>
 >>   
)           
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'paths'='array') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://myprojects/footie.json'

我不斷收到“服務：亞馬遜；狀態代碼：400；錯誤代碼：invalidrequestexception”。 爬蟲給我空行的數據同樣糟糕。 我不知道是否應該像其他帖子中建議的那樣嘗試更改文件格式，如果是這樣，我應該采用什么正確的格式？

Answer 1

您在問題中發布的 JSON 記錄缺少一些逗號，整個記錄應顯示在單行中，以便 Athena 正確查詢表，如下所示：

[{"player":"Charlie","club":{"position":"Attacking Midfield","competitor":"Bardsley","offense":[{"shots":13,"goals":1,"close_range":3,"fouls_against":2},{"shots":13,"goals":1,"close_range":3,"fouls_against":2}],"defense":[{"tackle":0,"interception":1,"blocked_shots":0,"fouls":5},{"tackle":3,"interception":4,"blocked_shots":3,"fouls":6}]},"training_schedule":[{"training_name":"Piggy in the middle","coach":"Grant Wool","training_start":"2008-03-02T14:00:00.000Z"},{"training_name":"Weight training","coach":"John Smith","training_start":"2008-03-02T16:00:00.000Z"},{"training_name":"Tactical Video Session","coach":"Eusebius Pontiff","training_start":"2008-03-02T18:00:00.000Z"},{"training_name":"Cross Country Run","coach":"John Smith","training_start":"2008-03-04T12:00:00.000Z"},{"training_name":"Offensive Possession Play","coach":"Grant Wool","training_start":"2008-03-04T16:00:00.000Z"},{"training_name":"Attacking Set Pieces","coach":"Grant Wool","training_start":"2008-03-05T12:00:00.000Z"},{"training_name":"Practice game (6 a side)","coach":"Grant Wool","training_start":"2008-03-05T14:00:00.000Z"}]}]

現在您的 DDL 的文件名出現在應該只是文件夾的位置，即而不是LOCATION 's3://myprojects/footie.json'它應該是LOCATION 's3://myprojects/'並且您需要確保只有文件與此位置下存在的此表/模式相關。

一旦我做了這些更改並在查詢下運行，我就能夠預覽數據。

CREATE EXTERNAL TABLE `test`(
  `array` array<struct<player:string,club:struct<position:string,competitor:string,offense:array<struct<shots:int,goals:int,close_range:int,fouls_against:int>>,defense:array<struct<tackle:int,interception:int,blocked_shots:int,fouls:int>>>,training_schedule:array<struct<training_name:string,coach:string,training_start:string>>>> COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='array') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://cvhgckgvk/'

AWS Athena 使用嵌套的 json 創建表

問題描述

1 個解決方案

解決方案1
1 已采納 2020-09-01 03:22:36

AWS Athena 使用嵌套的 json 創建表

問題描述

1 個解決方案

解決方案1 1 已采納 2020-09-01 03:22:36

解決方案1
1 已采納 2020-09-01 03:22:36