简体繁体 English

仅将 s3 分区文件之一添加到 AWS Glue

[英]Adding only one of s3 partitioned files to AWS Glue

原文 2019-12-16 19:26:31 4 1 python/ database/ amazon-web-services/ amazon-s3/ aws-glue

I'm having slight issues when it comes to running a crawler through my s3 buckets.在通过我的 s3 存储桶运行爬虫时，我遇到了一些小问题。 My folders have data that was dumped from redshift that was sliced into many different files.我的文件夹中有从 redshift 转储的数据，这些数据被切成许多不同的文件。 These files naming convention go as the following:这些文件命名约定如下：

dump_0000_part_00.gz, dump_0001_part_01.gz .... dump_0000_part_00.gz, dump_0001_part_01.gz ....

However when my crawler fetches the metadata in this folder, it instead makes a few hundred tables, thinking each one of these sliced files is its own table.但是，当我的爬虫获取此文件夹中的元数据时，它会生成几百个表，认为这些切片文件中的每一个都是自己的表。 Is there a way to tell the crawler to group all these sliced files into ONE catalog table?有没有办法告诉爬虫将所有这些切片文件分组到一个目录表中？

1 个解决方案

配置爬虫（或编辑现有爬虫）时，在Output部分下，展开Grouping behavior for S3 data (optional)并选择Create a single schema for each S3 path

如何在显示来自AWS S3的分区表的计数和架构时修复AWS Glue代码 - How to fix AWS Glue code in displaying count and schema of partitioned table from AWS S3

AWS Glue Studio： - 作业运行但将空文件输出到 S3 - AWS Glue Studio: - job runs but outputs empty files to S3

AWS Glue - 防止空导出到 S3 - AWS Glue - prevent empty exports to S3

aws Glue 作业：如何在 s3 中合并多个 output.csv 文件 - aws Glue job: how to merge multiple output .csv files in s3

使用 python 在 AWS S3 上将文件从一个文件夹复制到另一个文件夹 - copy files from one folder to another on AWS S3 with python

AWS Glue 数据从 S3 转移到 Redshift - AWS Glue Data moving from S3 to Redshift

如何使用 aws 胶水使用 s3 进行存储？ - How can I bucketing with s3 using aws glue?

无法从 aws 胶水写入 s3（属性错误） - Cannot write to s3 from aws glue (attribute error)

如何使用 AWS GLUE 对 S3 CSV 文件进行排序 - How to sort S3 CSV File using AWS GLUE

在AWS Glus pyspark作业中从s3加载JSON - Load JSON from s3 inside aws glue pyspark job

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在显示来自AWS S3的分区表的计数和架构时修复AWS Glue代码 - How to fix AWS Glue code in displaying count and schema of partitioned table from AWS S3 AWS Glue Studio： - 作业运行但将空文件输出到 S3 - AWS Glue Studio: - job runs but outputs empty files to S3 AWS Glue - 防止空导出到 S3 - AWS Glue - prevent empty exports to S3 aws Glue 作业：如何在 s3 中合并多个 output.csv 文件 - aws Glue job: how to merge multiple output .csv files in s3 使用 python 在 AWS S3 上将文件从一个文件夹复制到另一个文件夹 - copy files from one folder to another on AWS S3 with python AWS Glue 数据从 S3 转移到 Redshift - AWS Glue Data moving from S3 to Redshift 如何使用 aws 胶水使用 s3 进行存储？ - How can I bucketing with s3 using aws glue? 无法从 aws 胶水写入 s3（属性错误） - Cannot write to s3 from aws glue (attribute error) 如何使用 AWS GLUE 对 S3 CSV 文件进行排序 - How to sort S3 CSV File using AWS GLUE 在AWS Glus pyspark作业中从s3加载JSON - Load JSON from s3 inside aws glue pyspark job

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM