简体   繁体   English

AWS Athena 和处理 json

[英]AWS Athena and handling json

I have millions of files with the following (poor) JSON format:我有数百万个具有以下(差)JSON 格式的文件:

{
  "3000105002":[
    {
      "pool_id": "97808",
      "pool_name": "WILDCAT (DO NOT USE)",
      "status": "Zone Permanently Plugged",
      "bhl": "D-12-10N-05E 902 FWL 902 FWL",
      "acreage": ""
      },
      {
      "pool_id": "96838",
      "pool_name": "DRY & ABANDONED",
      "status": "Zone Permanently Plugged",
      "bhl": "D-12-10N-05E 902 FWL 902 FWL",
      "acreage": ""
      }]
}

I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:我试图生成一个 Athena DDL 来适应这种类型的结构(尤其是api字段):

CREATE EXTERNAL TABLE wp_info (
         api:array < struct < pool_id:string,
         pool_name:string,
         status:string,
         bhl:string,
         acreage:string>>)
LOCATION 's3://foo/'

After trying to generate a table with this, the following error is thrown:尝试使用此生成表后,将引发以下错误:

Your query has the following error(s):

FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type

What is a workable solution to this issue?这个问题的可行解决方案是什么? Note that the api string is different for every one of the million files.请注意,对于百万个文件中的每一个, api字符串都是不同的。 The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data. api实际上不在任何文件中,所以我希望 Athena 有一种方法可以只容纳这些数据的字符串类型

If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.如果您无法控制正在接收的 JSON 格式,并且中间没有流服务将 JSON 格式转换为更简单的格式,则可以使用正则表达式函数来检索您需要的相关数据.

A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.一种简单的方法是使用 Create-Table-As-Select (CTAS) 查询,它将数据从其复杂的 JSON 格式转换为更简单的表格式。

CREATE TABLE new_table 
WITH (
      external_location = 's3://path/to/ctas_partitioned/', 
      format = 'Parquet',
      parquet_compression = 'SNAPPY')
AS SELECT 
 regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
 regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
 ...
FROM json_lines_table;

You will improve the performance of the queries to the new table, as you are using Parquet format.当您使用 Parquet 格式时,您将提高对新表的查询性能。

Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme请注意,您还可以通过使用 external_location 作为's3://path/to/ctas_partitioned/part=01'或任何其他分区方案再次运行 CTAS 查询来更新表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM