简体   繁体   English

当s3数据存储同时包含json和.gz压缩文件时,如何通过Glue搜寻器创建AWS Athena表?

[英]How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

I have two problems in my intended solution: 我的预期解决方案有两个问题:

1. My S3 store structure is as following: 1.我的S3商店结构如下:

mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz

All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying. 所有json文件都具有相同的架构,我想创建一个指向mainfolder /的搜寻器 ,然后可以在Athena中创建一个表进行查询。

I have already tried with just one file format, eg if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. 我已经尝试过一种文件格式,例如,如果文件只是jsongz,那么搜寻器就可以正常工作,但是我正在寻找一种解决方案,通过该解决方案,我可以使两种类型的文件处理自动化。 I am open to write a custom script or any out of the box solution but need pointers where to start. 我愿意编写自定义脚本或任何现成的解决方案,但需要从何处开始的指针。

2. The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string . 2.第二个问题是我的json数据有一个字段(列),爬网程序将其解释为结构数据,但我想将该字段类型设为string Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. 原因是如果类型保留为struct,则日期/小时分区会出现不匹配错误,因为显然struct数据在文件中的内部架构不同。 I have tried to make a custom classifier but there are no options there to describe data types. 我试图做一个自定义分类器,但是那里没有描述数据类型的选项。

I would suggest skipping using a crawler altogether. 我建议完全跳过使用搜寻器。 In my experience Glue crawlers are not worth the problems they cause. 以我的经验,胶履带不值得他们引起的问题。 It's easy to create tables with the Glue API, and so is adding partitions. 使用Glue API创建表很容易,添加分区也是如此。 The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do. 该API有点冗长,特别是添加了分区,但与尝试使搜寻器执行您想要的操作相比,它的痛苦要小得多。

You can of course also create the table from Athena , that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). 当然,您也可以从Athena创建表 ,这样就可以确保获得与Athena兼容的表(否则,您需要获取一些细节才能正确使用)。 Adding partitions is also less verbose using SQL through Athena, but slower. 通过Athena使用SQL 添加分区也不那么冗长,但速度较慢。

Crawler will not take compressed and uncompressed data together , so it will not work out of box. 抓取工具不会将压缩和未压缩的数据放在一起,因此不会开箱即用。 It is better to write spark job in glue and use spark.read() 最好用胶水写spark作业并使用spark.read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 AWS Glue Crawler 为 S3 中具有分区列的文件创建 Athena 表 - Create Athena tables for files in S3 with paritioned columns using AWS Glue Crawler 读取 Athena 表时,如何在 AWS Glue 作业中排除 S3 上的文件或文件夹路径? - How to exclude either files or folder paths on S3 within an AWS Glue job when reading an Athena table? 使用Glue数据搜寻器处理压缩的gz文件以创建表架构 - Process a compressed gz file to create table schema using Glue Data crawler AWS Glue/Athena - S3 - 表分区 - AWS Glue/Athena - S3 - Table partitioning 如何使用JSON中的AWS Glue Crawler分类器创建Athena模式? - How to create Athena schema using AWS Glue Crawler classifier from JSON? AWS Glue Crawler 因 S3 上的 1100 万个文件而失败 - AWS Glue Crawler fails with 11 million files on S3 AWS胶水/ pyspark-如何使用Glue以编程方式创建Athena表 - aws glue / pyspark - how to create Athena table programmatically using Glue AWS Athena 从从 S3 的 GLUE 爬虫输入 csv 创建的表中返回零记录 - AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3 AWS Athena 从 GLUE Crawler 输入的表中返回零记录来自 S3 - AWS Athena Return Zero Records from Tables Created by GLUE Crawler input csv from S3 如何使用 AWS Crawler 为动态 S3 路径创建 Athena 表? - How to create Athena tables for dynamic S3 paths using AWS Crawler?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM