AWS Glue - Pick Dynamic File

Question

Does anyone know how to get a dynamic file from a S3 bucket? I setup a crawler on a S3 bucket however, my issue is, there will be new files coming each day with YYYY-MM-DD-HH-MM-SS suffix.

When I read the table through the catalog, it reads all the files present in the directory? Is it possible to dynamically pick the latest three files for a given day and use it as a Source?

Thanks!

Answer 1

You don't need to re-run crawler if files are located in the same place. For example, if your data folder is s3://bucket/data/<files> then you can add new files to it and run ETL job - new files will be picked up automatically.

However, if data arrives in new partitions (sub-folders) like s3://bucket/data/<year>/<month>/<day>/<files> then you need either to run a crawler or execute MSCK REPAIR TABLE <catalog-table-name> in Athena to register new partitions in Glue Catalog before starting Glue ETL job.

When data is loaded into DynamicFrame or spark's DataFrame you can apply some filters to use needed data only. If you still want to work with file names then you can add it as a column using input_file_name spark function and then apply filtering:

from pyspark.sql.functions import col, input_file_name

df.withColumn("filename", input_file_name)
  .where(col("filename") == "your-filename")

If you control how files are coming I would suggest to put them into partitions (sub-folders that indicate date, ie. /data/<year>/<month>/<day>/ or just /data/<year-month-day>/ ) so that you could benefit from using pushdown predicates in AWS Glue

AWS Glue - Pick Dynamic File

Question

1 answers

solution1
0 2018-09-30 23:58:37

AWS Glue - Pick Dynamic File

Question

1 answers

solution1 0 2018-09-30 23:58:37

solution1
0 2018-09-30 23:58:37