简体   繁体   中英

AWS Glue Crawler Creates Partition and File Tables

I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.

|--data
|   |--2018
|   |   |--01
|   |   |   |--01
|   |   |   |   |--01
|   |   |   |   |   |--file1.json
|   |   |   |   |   |--file2.json
|   |   |   |   |--02
|   |   |   |   |   |--file3.json
|   |   |   |   |   |--file4.json
...

I then setup an AWS Glue Crawler to crawl s3://bucket/data . The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc.

What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.

I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.

Most times files with just one record create separate tables. I tried files with greater than 2 records and was able to group everything under one table with respective partitions.

How does your json files look like?

Glue Crawler leaves a lot to be desired. It's promises to solve a lot of situations, but is really limited in what it actually supports. If your data is stored in directories and does not use Hive-style partitioning (eg year=2019/month=02/file.json ) it will more often than not mess up. It's especially frustrating when the data is produced by other AWS products, like Kinesis Firehose, which it looks like your data could be.

Depending on how much data you have I might start by just creating an unpartitioned Athena table that pointed to the root of the structure. It's only once your data grows beyond multiple gigabytes or thousands of files that partitioning becomes important.

Another strategy you could employ is to add a Lambda function that gets triggered by an S3 notification whenever a new object lands in your bucket. The function could look at the key and figure out which partition it belongs to and use the Glue API to add that partition to the table. Adding a partition that already exists will return an error from the API, but as long as your function catches it and ignores it you will be fine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM