简体   繁体   中英

AWS Athena create table with nested json

everyone. I have a nested json object. I am trying to create a table which I will then query. I am struggling to see where I could be going wrong. I have tried as was sugested in this post and followed this tutorial and have yet to create a table with actual readable data.

[{              
    "player": "Charlie",            
    "club": {           
        "position": "Attacking Midfield",       
        "competitor": "Bardsley",       
        "offense": [{       
                "shots": 13,
                "goals": 1,
                "close_range": 3
                "fouls_against": 2
            }, {    
                "shots": 13,
                "goals": 1,
                "close_range": 3
                "fouls_against": 2
            }   
        ],      
        "defense": [{       
                "tackle": 0,
                "interception": 1,
                "blocked_shots": 0
                "fouls": 5
            }, {    
                "tackle": 3,
                "interception": 4,
                "blocked_shots": 3
                "fouls": 6
            }   
        ],      
    },          
    "training_schedule": [
        {           
            "training_name": "Piggy in the middle",
            "coach": "Grant Wool"
            "training_start": "2008-03-02T14:00:00.000Z"
        }, {    
            "training_name": "Weight training",
            "coach": "John Smith"
            "training_start": "2008-03-02T16:00:00.000Z"
        }, {    
            "training_name": "Tactical Video Session",
            "coach": "Eusebius Pontiff"
            "training_start": "2008-03-02T18:00:00.000Z"
        }, {    
            "training_name": "Cross Country Run",
            "coach": "John Smith"
            "training_start": "2008-03-04T12:00:00.000Z"
        }, {    
            "training_name": "Offensive Possession Play",
            "coach": "Grant Wool"
            "training_start": "2008-03-04T16:00:00.000Z"
        }, {    
            "training_name": "Attacking Set Pieces",
            "coach": "Grant Wool"
            "training_start": "2008-03-05T12:00:00.000Z"
        }, {    
            "training_name": "Practice game (6 a side)",
            "coach": "Grant Wool"
            "training_start": "2008-03-05T14:00:00.000Z"
        }   
    ]   
}]

As you can see this is a nested json with all kinds of goodness. I am trying to create a table using this data to find the best players for the weekend. The problem I have is that when I load this data and attempt to create the table it fails with a none too clear message as to why. Here is what I have tried on AWS Athena:

CREATE EXTERNAL TABLE footie.players( 
 player array<struct< 
  player: string,
  game_stats struct<
        position: string,
                competitor: string,
                offense: array<struct<shots: int, goals: int, close_range: int, fouls_against: int>>,
                defense: array<struct<tackle: int, interception: int, blocked_shots: int, fouls: int>>
                   >,
  training_schedule: array<struct<
        training_name: string,
        coach: string
        training_start: string>
 >>   
)           
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'paths'='array') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://myprojects/footie.json'

I keep getting "service: amazonathena; status code: 400; error code: invalidrequestexception". The crawler is just as bad giving me empty rows of data. I'm at a loss whether I should try changing the file format as has been suggested in other posts and if so what is the correct format I should be going for?

The JSON record that you have posted in your question has some missing commas and the whole record should be present on single line for Athena to properly query the table as shown below:

[{"player":"Charlie","club":{"position":"Attacking Midfield","competitor":"Bardsley","offense":[{"shots":13,"goals":1,"close_range":3,"fouls_against":2},{"shots":13,"goals":1,"close_range":3,"fouls_against":2}],"defense":[{"tackle":0,"interception":1,"blocked_shots":0,"fouls":5},{"tackle":3,"interception":4,"blocked_shots":3,"fouls":6}]},"training_schedule":[{"training_name":"Piggy in the middle","coach":"Grant Wool","training_start":"2008-03-02T14:00:00.000Z"},{"training_name":"Weight training","coach":"John Smith","training_start":"2008-03-02T16:00:00.000Z"},{"training_name":"Tactical Video Session","coach":"Eusebius Pontiff","training_start":"2008-03-02T18:00:00.000Z"},{"training_name":"Cross Country Run","coach":"John Smith","training_start":"2008-03-04T12:00:00.000Z"},{"training_name":"Offensive Possession Play","coach":"Grant Wool","training_start":"2008-03-04T16:00:00.000Z"},{"training_name":"Attacking Set Pieces","coach":"Grant Wool","training_start":"2008-03-05T12:00:00.000Z"},{"training_name":"Practice game (6 a side)","coach":"Grant Wool","training_start":"2008-03-05T14:00:00.000Z"}]}]

Now your DDL has file name present in location which should be only the folder ie instead of LOCATION 's3://myprojects/footie.json' it should be LOCATION 's3://myprojects/' and you need to make sure only files related to this table/schema present under this location.

Once I did these changes and ran below query I was able to preview the data .

CREATE EXTERNAL TABLE `test`(
  `array` array<struct<player:string,club:struct<position:string,competitor:string,offense:array<struct<shots:int,goals:int,close_range:int,fouls_against:int>>,defense:array<struct<tackle:int,interception:int,blocked_shots:int,fouls:int>>>,training_schedule:array<struct<training_name:string,coach:string,training_start:string>>>> COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='array') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://cvhgckgvk/'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM