简体   繁体   中英

AWS Glue crawler exclude patterns not working

I have following AWS S3 bucket and folders:

I want to crawl the 3 parquet files under folder1 and folder2 (one in folder1 and 2 in folder2) under tfsdl_apac_test/rz_test. Folder1 and2 each contain a _default_log and _symlink_format_manifest folders and they have same files for both folder1 and folder2 for testing purpose and the contents in _default_log and _symlink_format_manifest are shown here for folder1.

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

This is the crawler settings: 在此处输入图像描述

But the crawler gives me this result in the generated tables: all the files from all the subfolders under rz_test. There are total of 13 files from all the subfolders.

在此处输入图像描述

The information from official document is a little misleading, the exclude pattern should be:

    **/_delta_log/**
    **/_symlink_format_manifest/**

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM