AWS Athena 和处理 json

Question

我有数百万个具有以下（差）JSON 格式的文件：

{
  "3000105002":[
    {
      "pool_id": "97808",
      "pool_name": "WILDCAT (DO NOT USE)",
      "status": "Zone Permanently Plugged",
      "bhl": "D-12-10N-05E 902 FWL 902 FWL",
      "acreage": ""
      },
      {
      "pool_id": "96838",
      "pool_name": "DRY & ABANDONED",
      "status": "Zone Permanently Plugged",
      "bhl": "D-12-10N-05E 902 FWL 902 FWL",
      "acreage": ""
      }]
}

我试图生成一个 Athena DDL 来适应这种类型的结构（尤其是api字段）：

CREATE EXTERNAL TABLE wp_info (
         api:array < struct < pool_id:string,
         pool_name:string,
         status:string,
         bhl:string,
         acreage:string>>)
LOCATION 's3://foo/'

尝试使用此生成表后，将引发以下错误：

Your query has the following error(s):

FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type

这个问题的可行解决方案是什么？ 请注意，对于百万个文件中的每一个， api字符串都是不同的。 api键实际上不在任何文件中，所以我希望 Athena 有一种方法可以只容纳这些数据的字符串类型值。

Answer 1

如果您无法控制正在接收的 JSON 格式，并且中间没有流服务将 JSON 格式转换为更简单的格式，则可以使用正则表达式函数来检索您需要的相关数据.

一种简单的方法是使用 Create-Table-As-Select (CTAS) 查询，它将数据从其复杂的 JSON 格式转换为更简单的表格式。

CREATE TABLE new_table 
WITH (
      external_location = 's3://path/to/ctas_partitioned/', 
      format = 'Parquet',
      parquet_compression = 'SNAPPY')
AS SELECT 
 regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
 regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
 ...
FROM json_lines_table;

当您使用 Parquet 格式时，您将提高对新表的查询性能。

请注意，您还可以通过使用 external_location 作为's3://path/to/ctas_partitioned/part=01'或任何其他分区方案再次运行 CTAS 查询来更新表。

AWS Athena 和处理 json

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-22 11:31:12

AWS Athena 和处理 json

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-22 11:31:12

解决方案1
1 已采纳 2020-05-22 11:31:12