简体   繁体   English

AWS Athena 使用嵌套的 json 创建表

[英]AWS Athena create table with nested json

everyone.每个人。 I have a nested json object.我有一个嵌套的 json 对象。 I am trying to create a table which I will then query.我正在尝试创建一个表,然后我将对其进行查询。 I am struggling to see where I could be going wrong.我正在努力寻找可能出错的地方。 I have tried as was sugested in this post and followed this tutorial and have yet to create a table with actual readable data.我已经按照这篇文章中的建议进行了尝试,并按照本教程进行了操作,但尚未创建具有实际可读数据的表。

[{              
    "player": "Charlie",            
    "club": {           
        "position": "Attacking Midfield",       
        "competitor": "Bardsley",       
        "offense": [{       
                "shots": 13,
                "goals": 1,
                "close_range": 3
                "fouls_against": 2
            }, {    
                "shots": 13,
                "goals": 1,
                "close_range": 3
                "fouls_against": 2
            }   
        ],      
        "defense": [{       
                "tackle": 0,
                "interception": 1,
                "blocked_shots": 0
                "fouls": 5
            }, {    
                "tackle": 3,
                "interception": 4,
                "blocked_shots": 3
                "fouls": 6
            }   
        ],      
    },          
    "training_schedule": [
        {           
            "training_name": "Piggy in the middle",
            "coach": "Grant Wool"
            "training_start": "2008-03-02T14:00:00.000Z"
        }, {    
            "training_name": "Weight training",
            "coach": "John Smith"
            "training_start": "2008-03-02T16:00:00.000Z"
        }, {    
            "training_name": "Tactical Video Session",
            "coach": "Eusebius Pontiff"
            "training_start": "2008-03-02T18:00:00.000Z"
        }, {    
            "training_name": "Cross Country Run",
            "coach": "John Smith"
            "training_start": "2008-03-04T12:00:00.000Z"
        }, {    
            "training_name": "Offensive Possession Play",
            "coach": "Grant Wool"
            "training_start": "2008-03-04T16:00:00.000Z"
        }, {    
            "training_name": "Attacking Set Pieces",
            "coach": "Grant Wool"
            "training_start": "2008-03-05T12:00:00.000Z"
        }, {    
            "training_name": "Practice game (6 a side)",
            "coach": "Grant Wool"
            "training_start": "2008-03-05T14:00:00.000Z"
        }   
    ]   
}]

As you can see this is a nested json with all kinds of goodness.如您所见,这是一个具有各种优点的嵌套 json。 I am trying to create a table using this data to find the best players for the weekend.我正在尝试使用这些数据创建一个表格来寻找周末的最佳球员。 The problem I have is that when I load this data and attempt to create the table it fails with a none too clear message as to why.我遇到的问题是,当我加载这些数据并尝试创建表时,它失败了,并没有太清楚的信息说明原因。 Here is what I have tried on AWS Athena:这是我在 AWS Athena 上尝试过的:

CREATE EXTERNAL TABLE footie.players( 
 player array<struct< 
  player: string,
  game_stats struct<
        position: string,
                competitor: string,
                offense: array<struct<shots: int, goals: int, close_range: int, fouls_against: int>>,
                defense: array<struct<tackle: int, interception: int, blocked_shots: int, fouls: int>>
                   >,
  training_schedule: array<struct<
        training_name: string,
        coach: string
        training_start: string>
 >>   
)           
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'paths'='array') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://myprojects/footie.json'

I keep getting "service: amazonathena; status code: 400; error code: invalidrequestexception".我不断收到“服务:亚马逊;状态代码:400;错误代码:invalidrequestexception”。 The crawler is just as bad giving me empty rows of data.爬虫给我空行的数据同样糟糕。 I'm at a loss whether I should try changing the file format as has been suggested in other posts and if so what is the correct format I should be going for?我不知道是否应该像其他帖子中建议的那样尝试更改文件格式,如果是这样,我应该采用什么正确的格式?

The JSON record that you have posted in your question has some missing commas and the whole record should be present on single line for Athena to properly query the table as shown below:您在问题中发布的 JSON 记录缺少一些逗号,整个记录应显示在单行中,以便 Athena 正确查询表,如下所示:

[{"player":"Charlie","club":{"position":"Attacking Midfield","competitor":"Bardsley","offense":[{"shots":13,"goals":1,"close_range":3,"fouls_against":2},{"shots":13,"goals":1,"close_range":3,"fouls_against":2}],"defense":[{"tackle":0,"interception":1,"blocked_shots":0,"fouls":5},{"tackle":3,"interception":4,"blocked_shots":3,"fouls":6}]},"training_schedule":[{"training_name":"Piggy in the middle","coach":"Grant Wool","training_start":"2008-03-02T14:00:00.000Z"},{"training_name":"Weight training","coach":"John Smith","training_start":"2008-03-02T16:00:00.000Z"},{"training_name":"Tactical Video Session","coach":"Eusebius Pontiff","training_start":"2008-03-02T18:00:00.000Z"},{"training_name":"Cross Country Run","coach":"John Smith","training_start":"2008-03-04T12:00:00.000Z"},{"training_name":"Offensive Possession Play","coach":"Grant Wool","training_start":"2008-03-04T16:00:00.000Z"},{"training_name":"Attacking Set Pieces","coach":"Grant Wool","training_start":"2008-03-05T12:00:00.000Z"},{"training_name":"Practice game (6 a side)","coach":"Grant Wool","training_start":"2008-03-05T14:00:00.000Z"}]}]

Now your DDL has file name present in location which should be only the folder ie instead of LOCATION 's3://myprojects/footie.json' it should be LOCATION 's3://myprojects/' and you need to make sure only files related to this table/schema present under this location.现在您的 DDL 的文件名出现在应该只是文件夹的位置,即而不是LOCATION 's3://myprojects/footie.json'它应该是LOCATION 's3://myprojects/'并且您需要确保只有文件与此位置下存在的此表/模式相关。

Once I did these changes and ran below query I was able to preview the data .一旦我做了这些更改并在查询下运行,我就能够预览数据

CREATE EXTERNAL TABLE `test`(
  `array` array<struct<player:string,club:struct<position:string,competitor:string,offense:array<struct<shots:int,goals:int,close_range:int,fouls_against:int>>,defense:array<struct<tackle:int,interception:int,blocked_shots:int,fouls:int>>>,training_schedule:array<struct<training_name:string,coach:string,training_start:string>>>> COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='array') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://cvhgckgvk/'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM