简体   繁体   中英

AWS Glue Crawler glob Exclude Pattern functionality

We need to ignore a few paths while crawling through a specific path. Below are the details:

Include Path: s3://dev-bronze/api/sp/reports/xyz/
Exclude Path: brand=abc/client=xxx/**

Full path : "s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/"

We want to ignore a few client's data. So I am using the above glob but it doesn't seem to work. Any help will be highly appreciated.

Clarifying the difference between exclude patterns brand=abc/client=xxx/** and brand=abc/client=xxx** (note the missing / ).

Exclude pattern brand=abc/client=xxx/** matches:

s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/<subfolder1>/file1.txt
s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/<subfolder2>/file2.txt

This pattern will match objects in all subfolders of brand=abc/client=xxx/ .

Exclude pattern brand=abc/client=xxx** matches:

s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/file1.txt
s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/file2.txt

This pattern will match all objects in brand=abc/client=xxx/ .

If you want to exclude files in brand=abc/client=xxx/ , then use the exclude pattern brand=abc/client=xxx** .

Reference: Crawler Properties > Include and Exclude Patterns (AWS)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM