当s3数据存储同时包含json和.gz压缩文件时，如何通过Glue搜寻器创建AWS Athena表？

Question

I have two problems in my intended solution: 我的预期解决方案有两个问题：

1. My S3 store structure is as following: 1.我的S3商店结构如下：

mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz

All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying. 所有json文件都具有相同的架构，我想创建一个指向mainfolder /的搜寻器 ，然后可以在Athena中创建一个表进行查询。

I have already tried with just one file format, eg if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. 我已经尝试过一种文件格式，例如，如果文件只是json或gz，那么搜寻器就可以正常工作，但是我正在寻找一种解决方案，通过该解决方案，我可以使两种类型的文件处理自动化。 I am open to write a custom script or any out of the box solution but need pointers where to start. 我愿意编写自定义脚本或任何现成的解决方案，但需要从何处开始的指针。

2. The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string . 2.第二个问题是我的json数据有一个字段（列），爬网程序将其解释为结构数据，但我想将该字段类型设为string 。 Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. 原因是如果类型保留为struct，则日期/小时分区会出现不匹配错误，因为显然struct数据在文件中的内部架构不同。 I have tried to make a custom classifier but there are no options there to describe data types. 我试图做一个自定义分类器，但是那里没有描述数据类型的选项。

Answer 1

I would suggest skipping using a crawler altogether. 我建议完全跳过使用搜寻器。 In my experience Glue crawlers are not worth the problems they cause. 以我的经验，胶履带不值得他们引起的问题。 It's easy to create tables with the Glue API, and so is adding partitions. 使用Glue API创建表很容易，添加分区也是如此。 The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do. 该API有点冗长，特别是添加了分区，但与尝试使搜寻器执行您想要的操作相比，它的痛苦要小得多。

You can of course also create the table from Athena , that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). 当然，您也可以从Athena创建表，这样就可以确保获得与Athena兼容的表（否则，您需要获取一些细节才能正确使用）。 Adding partitions is also less verbose using SQL through Athena, but slower. 通过Athena使用SQL 添加分区也不那么冗长，但速度较慢。

Answer 2

Crawler will not take compressed and uncompressed data together , so it will not work out of box. 抓取工具不会将压缩和未压缩的数据放在一起，因此不会开箱即用。 It is better to write spark job in glue and use spark.read() 最好用胶水写spark作业并使用spark.read（）

当s3数据存储同时包含json和.gz压缩文件时，如何通过Glue搜寻器创建AWS Athena表？

问题描述

2 个解决方案

解决方案1
1 2019-09-15 06:35:05

解决方案2
0 2019-09-21 05:35:28

当s3数据存储同时包含json和.gz压缩文件时，如何通过Glue搜寻器创建AWS Athena表？

问题描述

2 个解决方案

解决方案1 1 2019-09-15 06:35:05

解决方案2 0 2019-09-21 05:35:28

解决方案1
1 2019-09-15 06:35:05

解决方案2
0 2019-09-21 05:35:28