aws athena查询json数组数据

Question

i'm not able to query S3 files with Aws Athena, the content of the files are regular json arrays like this:我无法使用 Aws Athena 查询 S3 文件，文件的内容是常规的 json arrays，如下所示：

[
  {
    "DataInvio": "2020-02-06T13:37:00+00:00",
    "DataLettura": "2020-02-06T13:35:50+00:00",
    "FlagDownloaded": 0,
    "GUID": "f257c9c0-b7e1-4663-8d6d-97e652b27c10",
    "IMEI": "866100000062167",
    "Id": 0,
    "IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
    "IdTagLocal": 0,
    "SerialNumber": "142707160028BJZZZZ",
    "Tag": "E200001697080089188056D2",
    "Tipo": "B",
    "TipoEvento": "L",
    "TipoSegnalazione": 0,
    "TipoTag": "C",
    "UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
  },
  {
    "DataInvio": "2020-02-06T13:37:00+00:00",
    "DataLettura": "2020-02-06T13:35:50+00:00",
    "FlagDownloaded": 0,
    "GUID": "e531272e-465c-4294-950d-95a683ff8e3b",
    "IMEI": "866100000062167",
    "Id": 0,
    "IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
    "IdTagLocal": 0,
    "SerialNumber": "142707160028BJZZZZ",
    "Tag": "E200341201321E0000A946D2",
    "Tipo": "B",
    "TipoEvento": "L",
    "TipoSegnalazione": 0,
    "TipoTag": "C",
    "UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
  }
]

a simple query select * from mytable returns empty rows if the table has been generated in this way如果以这种方式生成表，则select * from mytable返回空行

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
  `IdSessione` string,
  `DataLettura` date,
  `GUID` string,
  `DataInvio` date 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'ignore.malformed.json' = 'true'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')

or it gives me an error HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Missing value at 1 [character 2 line 1] if the table has been generated with:或者它给我一个错误HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Missing value at 1 [character 2 line 1]如果表是用以下方法生成的：

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable(
  `IdSessione` string,
  `DataLettura` date,
  `GUID` string,
  `DataInvio` date 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')

if i modify the content of the file in this way (an json object each rows without trailing commas, the query gives me results)如果我以这种方式修改文件的内容（json object 每行没有尾随逗号，查询会给我结果）

{ "DataInvio": "2020-02-06T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
{ "DataInvio": "2020-02-07T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}

How to query json array structures directly?如何直接查询json数组结构？

Answer 1

Athena Best Practices recommends to have one json per row: Athena Best Practices建议每行有一个 json：

Make sure that each JSON-encoded record is represented on a separate line.确保每个 JSON 编码的记录都在单独的行中表示。

This has been asked a few times and I don't think someone made it work with a array of json:这已经被问过几次了，我认为没有人让它与一组 json 一起工作：

Answer 2

This is related to the formatting of the JSON objects.这与 JSON 对象的格式有关。 The resolution of these issues is also described here: https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/此处还描述了这些问题的解决方案： https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/

Apart from this, if you are using AWS Glue to crawl these files, make sure the Classification of database table of Data Catalog is not "UNKNOWN".除此之外，如果您使用 AWS Glue 来爬取这些文件，请确保 Data Catalog 的数据库表分类不是“未知”。

aws athena查询json数组数据

问题描述

2 个解决方案

解决方案1
1 2020-11-04 19:05:16

解决方案2
0 2022-12-07 16:37:47

aws athena查询json数组数据

问题描述

2 个解决方案

解决方案1 1 2020-11-04 19:05:16

解决方案2 0 2022-12-07 16:37:47

解决方案1
1 2020-11-04 19:05:16

解决方案2
0 2022-12-07 16:37:47