简体   繁体   中英

Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it?

I am in process of running a AWS Glue ETL job by reading json files from multiple S3 buckets with names rawpart1.json and rawpart2.json. Validate fields from both files along with filenames from both the S3 buckets. Can I get filenames to read and change? After the ETL job runs, create filename to the output of ETL job in S3 bucket. Currently I am getting run-15902070851728-part-r-00000 as filename. Let me know if we can do this in pyspark? Thanks

You cannot control the output filename generated by spark. But if you want file name for reading the specific file, you can leverage boto3 to get the file name from s3 bucket then pass it to your etl job to read that particular file.

The output file name cannot be controlled as multiple executors are responsible for generating the output file. We can control the folder name where we want the output data but not the file name.

You can use the DynamicFrame repartition method to reduce the number of output partitions/files before you write out your frame. And although Spark cannot name your output file as mentioned above, it can still be renamed after it has been written to S3.

Please refer to this answer which uses a Hadoop FileSystem object created from an S3 path to allow you to modify the output filename. You would need to use Boto3 to capture the input file name to replace {desired_name} in that answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM