简体   繁体   中英

AWS Athena and handling json

I have millions of files with the following (poor) JSON format:

{
  "3000105002":[
    {
      "pool_id": "97808",
      "pool_name": "WILDCAT (DO NOT USE)",
      "status": "Zone Permanently Plugged",
      "bhl": "D-12-10N-05E 902 FWL 902 FWL",
      "acreage": ""
      },
      {
      "pool_id": "96838",
      "pool_name": "DRY & ABANDONED",
      "status": "Zone Permanently Plugged",
      "bhl": "D-12-10N-05E 902 FWL 902 FWL",
      "acreage": ""
      }]
}

I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:

CREATE EXTERNAL TABLE wp_info (
         api:array < struct < pool_id:string,
         pool_name:string,
         status:string,
         bhl:string,
         acreage:string>>)
LOCATION 's3://foo/'

After trying to generate a table with this, the following error is thrown:

Your query has the following error(s):

FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type

What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.

If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.

A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.

CREATE TABLE new_table 
WITH (
      external_location = 's3://path/to/ctas_partitioned/', 
      format = 'Parquet',
      parquet_compression = 'SNAPPY')
AS SELECT 
 regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
 regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
 ...
FROM json_lines_table;

You will improve the performance of the queries to the new table, as you are using Parquet format.

Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM