简体繁体 English

运行 AWS Glue ETL 作业并命名 output 文件名时，有没有办法从 S3 存储桶读取文件名。 pyspark 是否提供了一种方法来做到这一点？

[英]Is there a way to read filename from S3 bucket when running AWS Glue ETL job and name the output filename. Does pyspark provide a way to do it?

原文 2020-06-08 23:33:54 9 2 amazon-web-services/ amazon-s3/ pyspark/ aws-glue/ aws-glue-data-catalog

I am in process of running a AWS Glue ETL job by reading json files from multiple S3 buckets with names rawpart1.json and rawpart2.json.我正在通过从多个名为 rawpart1.json 和 rawpart2.json 的 S3 存储桶中读取 json 文件来运行 AWS Glue ETL 作业。 Validate fields from both files along with filenames from both the S3 buckets.验证两个文件中的字段以及两个 S3 存储桶中的文件名。 Can I get filenames to read and change?我可以读取和更改文件名吗？ After the ETL job runs, create filename to the output of ETL job in S3 bucket. ETL 作业运行后，在 S3 存储桶中创建 ETL 作业的 output 文件名。 Currently I am getting run-15902070851728-part-r-00000 as filename.目前我将 run-15902070851728-part-r-00000 作为文件名。 Let me know if we can do this in pyspark?让我知道我们是否可以在 pyspark 中做到这一点？ Thanks谢谢

2 个解决方案

You cannot control the output filename generated by spark.您无法控制 spark 生成的 output 文件名。 But if you want file name for reading the specific file, you can leverage boto3 to get the file name from s3 bucket then pass it to your etl job to read that particular file.但是，如果您想要文件名来读取特定文件，您可以利用 boto3 从 s3 存储桶中获取文件名，然后将其传递给您的 etl 作业以读取该特定文件。

The output file name cannot be controlled as multiple executors are responsible for generating the output file. output 文件名无法控制，因为多个执行器负责生成 output 文件。 We can control the folder name where we want the output data but not the file name.我们可以控制我们想要 output 数据的文件夹名称，但不能控制文件名。

You can use the DynamicFrame repartition method to reduce the number of output partitions/files before you write out your frame.在写出框架之前，您可以使用 DynamicFrame 重新分区方法来减少 output 分区/文件的数量。 And although Spark cannot name your output file as mentioned above, it can still be renamed after it has been written to S3.虽然 Spark 无法像上面提到的那样命名您的 output 文件，但在将其写入 S3 后仍然可以对其进行重命名。

Please refer to this answer which uses a Hadoop FileSystem object created from an S3 path to allow you to modify the output filename.请参阅此答案，它使用从 S3 路径创建的 Hadoop 文件系统 object 以允许您修改 output 文件名。 You would need to use Boto3 to capture the input file name to replace {desired_name} in that answer.您需要使用 Boto3 捕获输入文件名以替换该答案中的{desired_name} 。