简体繁体中英

Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it?

原文 2020-06-08 23:33:54 9 2 amazon-web-services/ amazon-s3/ pyspark/ aws-glue/ aws-glue-data-catalog

I am in process of running a AWS Glue ETL job by reading json files from multiple S3 buckets with names rawpart1.json and rawpart2.json. Validate fields from both files along with filenames from both the S3 buckets. Can I get filenames to read and change? After the ETL job runs, create filename to the output of ETL job in S3 bucket. Currently I am getting run-15902070851728-part-r-00000 as filename. Let me know if we can do this in pyspark? Thanks

2 answers

You cannot control the output filename generated by spark. But if you want file name for reading the specific file, you can leverage boto3 to get the file name from s3 bucket then pass it to your etl job to read that particular file.

The output file name cannot be controlled as multiple executors are responsible for generating the output file. We can control the folder name where we want the output data but not the file name.

You can use the DynamicFrame repartition method to reduce the number of output partitions/files before you write out your frame. And although Spark cannot name your output file as mentioned above, it can still be renamed after it has been written to S3.

Please refer to this answer which uses a Hadoop FileSystem object created from an S3 path to allow you to modify the output filename. You would need to use Boto3 to capture the input file name to replace {desired_name} in that answer.

How to Read Filename from S3 using AWS Glue ETL Tools

AWS Glue ETL job from AWS Redshift to S3 fails

How do I save machine learning model(Kmeans) in S3 from glue ETL job in written in pyspark?

Error when read csv with pyspark from AWS s3 Bucket

How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

AWS Glue ETL : transfer data to S3 Bucket

AWS Glue ETL Job triggered on batches of S3 Events

AWS Glue Limit data read from S3 Bucket

Is there an easy way of getting the latest filename from the latest folder in s3 bucket using node.js

AWS-s3 : How to copy files from s3 bucket to another s3 bucket based on filename

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to Read Filename from S3 using AWS Glue ETL Tools AWS Glue ETL job from AWS Redshift to S3 fails How do I save machine learning model(Kmeans) in S3 from glue ETL job in written in pyspark? Error when read csv with pyspark from AWS s3 Bucket How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda? AWS Glue ETL : transfer data to S3 Bucket AWS Glue ETL Job triggered on batches of S3 Events AWS Glue Limit data read from S3 Bucket Is there an easy way of getting the latest filename from the latest folder in s3 bucket using node.js AWS-s3 : How to copy files from s3 bucket to another s3 bucket based on filename

Related Tags

Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it?

Question

2 answers

solution1
0 2020-06-09 06:07:56

solution2
0 2020-06-10 04:52:37

Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it?

Question

2 answers

solution1 0 2020-06-09 06:07:56

solution2 0 2020-06-10 04:52:37

solution1
0 2020-06-09 06:07:56

solution2
0 2020-06-10 04:52:37