AWS Athena 和處理 json

Question

我有數百萬個具有以下（差）JSON 格式的文件：

{
  "3000105002":[
    {
      "pool_id": "97808",
      "pool_name": "WILDCAT (DO NOT USE)",
      "status": "Zone Permanently Plugged",
      "bhl": "D-12-10N-05E 902 FWL 902 FWL",
      "acreage": ""
      },
      {
      "pool_id": "96838",
      "pool_name": "DRY & ABANDONED",
      "status": "Zone Permanently Plugged",
      "bhl": "D-12-10N-05E 902 FWL 902 FWL",
      "acreage": ""
      }]
}

我試圖生成一個 Athena DDL 來適應這種類型的結構（尤其是api字段）：

CREATE EXTERNAL TABLE wp_info (
         api:array < struct < pool_id:string,
         pool_name:string,
         status:string,
         bhl:string,
         acreage:string>>)
LOCATION 's3://foo/'

嘗試使用此生成表后，將引發以下錯誤：

Your query has the following error(s):

FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type

這個問題的可行解決方案是什么？ 請注意，對於百萬個文件中的每一個， api字符串都是不同的。 api鍵實際上不在任何文件中，所以我希望 Athena 有一種方法可以只容納這些數據的字符串類型值。

Answer 1

如果您無法控制正在接收的 JSON 格式，並且中間沒有流服務將 JSON 格式轉換為更簡單的格式，則可以使用正則表達式函數來檢索您需要的相關數據.

一種簡單的方法是使用 Create-Table-As-Select (CTAS) 查詢，它將數據從其復雜的 JSON 格式轉換為更簡單的表格式。

CREATE TABLE new_table 
WITH (
      external_location = 's3://path/to/ctas_partitioned/', 
      format = 'Parquet',
      parquet_compression = 'SNAPPY')
AS SELECT 
 regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
 regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
 ...
FROM json_lines_table;

當您使用 Parquet 格式時，您將提高對新表的查詢性能。

請注意，您還可以通過使用 external_location 作為's3://path/to/ctas_partitioned/part=01'或任何其他分區方案再次運行 CTAS 查詢來更新表。

AWS Athena 和處理 json

問題描述

1 個解決方案

解決方案1
1 已采納 2020-05-22 11:31:12

AWS Athena 和處理 json

問題描述

1 個解決方案

解決方案1 1 已采納 2020-05-22 11:31:12

解決方案1
1 已采納 2020-05-22 11:31:12