简体   繁体   English

在 hive 中查询多行 JSON 文件

[英]Multi-line JSON file querying in hive

I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.据我所知,大多数的JSON格式SERDE期望.json文件存储,每行一个记录。

I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).我有一个带有多行缩进.json文件(不控制源)的 S3 存储桶,我想使用 Amazon Athena 查询(尽管我认为这通常也适用于 Hive)。

  1. Is there a SerDe format out there that is able to parse multi-line indented .json files?是否有可以解析多行缩进.json文件的 SerDe 格式?
  2. If there isn't a SerDe format to do this:如果没有SerDe 格式来执行此操作:
    • Is there a best practice for dealing with files like this?是否有处理此类文件的最佳实践?
      • Should I plan on flattening these records out using a different tool like python?我应该计划使用不同的工具(如 python)将这些记录展平吗?
    • Is there a standard way of writing custom SerDe formats, so I can write one myself?是否有编写自定义 SerDe 格式的标准方法,以便我自己编写一个?

Example file body:示例文件体:

[
  {
    "id": 1,
    "name": "ryan",
    "stuff: {
      "x": true,
      "y": [
        123,
        456
      ]
    },
  },
  ...
]

There is unfortunately no serde that supports multiline JSON content.不幸的是,没有支持多行 JSON 内容的 serde。 There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible.有专门的 CloudTrail serde 支持与您的格式类似的格式,但它仅针对 CloudTrail JSON 格式进行了硬编码 - 但至少它表明它至少在理论上是可能的。 Currently there is no way to write your own serdes to use with Athena, though.但是,目前无法编写自己的 serdes 以与 Athena 一起使用。

You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.您将无法使用 Athena 使用这些文件,您必须先使用 EMR、Glue 或其他一些工具将它们重新格式化为 JSON 流文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM