I have following AWS S3 bucket and folders:
I want to crawl the 3 parquet files under folder1 and folder2 (one in folder1 and 2 in folder2) under tfsdl_apac_test/rz_test. Folder1 and2 each contain a _default_log and _symlink_format_manifest folders and they have same files for both folder1 and folder2 for testing purpose and the contents in _default_log and _symlink_format_manifest are shown here for folder1.
But the crawler gives me this result in the generated tables: all the files from all the subfolders under rz_test. There are total of 13 files from all the subfolders.
The information from official document is a little misleading, the exclude pattern should be:
**/_delta_log/**
**/_symlink_format_manifest/**
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.