使用 AWS Glue 爬网程序/分类器/ETL 作业将带有数组的 JSON 展平

Question

I'm crawling following JSON file (it's a valid JSON) from s3 data lake.我正在从 s3 数据湖抓取 JSON 文件（它是有效的 JSON）。 Inside there are 2 fields (device, timestamp) and an array of objects called "data".里面有 2 个字段（设备、时间戳）和一个称为“数据”的对象数组。 Each object in the data array differs from one another.数据数组中的每个对象彼此不同。

{
  "device": "0013374838793C8",
  "timestamp": "2019-03-04T14:44:39Z",
  "data": [
    { "eparke_status": "09" },
    { "eparke_x": "FFF588" },
    { "eparke_y": "000352" },
    { "eparke_z": "000ACC" },
    { "eparke_temp": "14.00" },
    { "eparke_voltage": "4.17" }
  ]
}

Unfortunately, when I'm crawling with AWS Glue crawler schema cannot be inferred properly and what I got in Athena is not what I expect.不幸的是，当我使用 AWS Glue 爬网程序进行爬网时，无法正确推断出我在 Athena 中得到的内容并不是我所期望的。

Following listing shows a row of data from AWS Athena.以下清单显示了来自 AWS Athena 的一行数据。

1   0013374838793C8 2019-03-05T13:11:41Z    [{eparke_status=0B, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=FFF6D4, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=000133, eparke_z=null, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=000DA3, eparke_temp=null, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=14.00, eparke_voltage=null}, {eparke_status=null, eparke_x=null, eparke_y=null, eparke_z=null, eparke_temp=null, eparke_voltage=4.17}]

As you can see for each object inside array schema is discovered "wrongly".正如您所看到的，数组模式中的每个对象都被“错误地”发现。 Each column in DB contains ALL of the array objects fields, the majority of which are just set as nulls, which is understandable because they are not found. DB 中的每一列都包含所有数组对象字段，其中大部分只是设置为空值，这是可以理解的，因为找不到它们。 The discovered schema is not what I'm looking for.发现的模式不是我要找的。
Expectations期望

Following listing shows expected form of a table row after crawling with AWS Glue.以下清单显示了使用 AWS Glue 进行爬网后表格行的预期形式。

1   0013374838793C8 2019-03-05T13:11:41Z    eparke_status=0B eparke_x=FFF6D4 eparke_y=000133 eparke_z=000DA3 eparke_temp=14.00 eparke_voltage=4.17

What I have tried so far?到目前为止我尝试过什么？

AWS Glue Classifiers To force schema I tried to use classifiers. AWS Glue 分类器为了强制模式，我尝试使用分类器。

$.device $.timestamp $.eparke_status $.eparke_x $.eparke_y $.eparke_z $.eparke_temp $.eparke_voltage

and和

$.device $.timestamp $.data[0].eparke_status $.data[1].eparke_x $.data[2].eparke_y $.data[3].eparke_z $.data[4].eparke_temp $.data[5].eparke_voltage

Still, final schema looks the same - all objects are packed inside each column.尽管如此，最终模式看起来还是一样的——所有的对象都被打包在每一列中。

Any ideas how to address this issue?任何想法如何解决这个问题？ I'm also trying to configure ETL job with a custom script but failed so far.我也在尝试使用自定义脚本配置 ETL 作业，但到目前为止失败了。

Answer 1

One thing I noticed is that once a crawler runs once, the initially inferred schema and selected crawlers tend to not change on a new run.我注意到的一件事是，一旦一个爬虫运行一次，最初推断的模式和选定的爬虫在新的运行中往往不会改变。 I just think it is safer to duplicate crawlers and delete any previously created tables while playing around.我只是认为在玩游戏时复制爬虫并删除任何以前创建的表更安全。

I am not sure you can concatenate multiple root expressions in the Json classifier expression.我不确定您是否可以在 Json 分类器表达式中连接多个根表达式。 Documentation says that for JSON classifier you just need to provide THE path to the node of each line that will be considered as the actual json to infer the schema from文档说，对于 JSON 分类器，您只需要提供每行节点的路径，该路径将被视为实际的 json 以从中推断架构

To use each element of the array to infer the schema, you would have to use $.data[*].要使用数组的每个元素来推断模式，您必须使用 $.data[*]。 But that would mean you would miss the device and timestamp.但这意味着您会错过设备和时间戳。

You cannot do this simply through a crawler.你不能简单地通过爬虫来做到这一点。 My recommendation is to parse with no custom classifier and then UNNEST the data from the array structure using an Athena query ( https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html ).我的建议是不使用自定义分类器进行解析，然后使用 Athena 查询 ( https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html ) 从数组结构中取消嵌套数据。 Load the result to some datastore if needed.如果需要，将结果加载到某个数据存储。 For S3, look at CTAS as an option.对于 S3，将 CTAS 视为一个选项。 You might also be able to configure this as a ETL job您也可以将其配置为 ETL 作业

使用 AWS Glue 爬网程序/分类器/ETL 作业将带有数组的 JSON 展平

问题描述

1 个解决方案

解决方案1
1 2019-05-30 17:44:34

使用 AWS Glue 爬网程序/分类器/ETL 作业将带有数组的 JSON 展平

问题描述

1 个解决方案

解决方案1 1 2019-05-30 17:44:34

解决方案1
1 2019-05-30 17:44:34