简体   繁体   中英

How can I exclude specific folders with a specific year for the crawler in AWS Glue?

I do not quite understand how AWS Glue exclude paths work for the crawler. I have the following folder structure:

s3://hourly/20170101-20170201/
s3://hourly/20180101-20180201/
s3://hourly/20180201-20180301/
s3://hourly/20190701-20190801/
s3://hourly/20190801-20190901/
s3://hourly/20190901-20191001/
s3://hourly/20200101-20200201/
s3://hourly/20200201-20200301/

Now I want to exclude all folders that are not from 2018 for example using exclude.

According to the docs:

https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude

What I can do is to include for example the field *2020* to exclude the folders with the year 2020 from my list. So what I tried was adding all the years to the exclusion fields: *2017* *2019* and *2020*.

排除集合选项

My results, however, still have the other years in it so apparently this did not work. I also tried this here *{2017,2019,2020}* which did not work either.

Can someone tell me how I need to use the exclusion pattern to only include folders that have the year 2018 in it? Like only these here and ignore the rest?

s3://hourly/20180101-20180201/
s3://hourly/20180201-20180301/

You need to specify an S3 key prefix pattern, which means the pattern needs to match the prefix (beginning) of the object key. The pattern is not a regular expression that can simply match any substring of the object key.

Assuming the S3 bucket name is not actually hourly and the Glue crawler's crawl path is set to the root of the bucket, then the exclude pattern should contain a wildcard ( * ) to match the hourly prefix.

s3://<bucket_name>/hourly/20170101-20170201/
s3://<bucket_name>/hourly/20180101-20180201/
s3://<bucket_name>/hourly/20180201-20180301/
s3://<bucket_name>/hourly/20190701-20190801/
s3://<bucket_name>/hourly/20190801-20190901/
s3://<bucket_name>/hourly/20190901-20191001/
s3://<bucket_name>/hourly/20200101-20200201/
s3://<bucket_name>/hourly/20200201-20200301/

Exclude Patterns:

*/2017*  <-- matches key 'hourly/20170101-20170201/'
*/2019*
*/2020*

The exclude pattern is relative to the Glue crawler's crawl path.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM