AWS Glue-选择动态文件

Question

Does anyone know how to get a dynamic file from a S3 bucket? 有谁知道如何从S3存储桶中获取动态文件？ I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix. 我在S3存储桶上安装了搜寻器，但是我的问题是，每天都会有新文件带有YYYY-MM-DD-HH-MM-SS后缀。

When I read the table through the catalog, it reads all the files present in the directory? 当我通过目录读取表时，它会读取目录中存在的所有文件？ Is it possible to dynamically pick the latest three files for a given day and use it as a Source? 是否可以动态选择给定日期的最新三个文件并将其用作源？

Thanks! 谢谢！

Answer 1

You don't need to re-run crawler if files are located in the same place. 如果文件位于同一位置，则无需重新运行搜寻器。 For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically. 例如，如果您的数据文件夹为s3://bucket/data/<files>则可以向其中添加新文件并运行ETL作业-新文件将自动被拾取。

However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job. 但是，如果数据到达新分区（如s3://bucket/data/<year>/<month>/<day>/<files>则您需要运行MSCK REPAIR TABLE <catalog-table-name>器或执行MSCK REPAIR TABLE <catalog-table-name>在Athena中使用MSCK REPAIR TABLE <catalog-table-name>在开始Glue ETL作业之前在Glue Catalog中注册新分区。

When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. 当数据加载到DynamicFrame或spark的DataFrame中时，您可以应用一些过滤器以仅使用所需的数据。 If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering: 如果仍要使用文件名，则可以使用input_file_name spark函数将其添加为列，然后应用过滤：

from pyspark.sql.functions import col, input_file_name

df.withColumn("filename", input_file_name)
  .where(col("filename") == "your-filename")

If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/ ) so that you could benefit from using pushdown predicates in AWS Glue 如果您控制文件的发送方式，建议您将其放入分区（指示日期的子文件夹，即/data/<year>/<month>/<day>/或/data/<year-month-day>/ ），以便您可以从在AWS Glue中使用下推谓词中受益

AWS Glue-选择动态文件

问题描述

1 个解决方案

解决方案1
0 2018-09-30 23:58:37

AWS Glue-选择动态文件

问题描述

1 个解决方案

解决方案1 0 2018-09-30 23:58:37

解决方案1
0 2018-09-30 23:58:37