如何根据 AWS Glue 作业中 dataframe 的不同值写入多个 S3 存储桶？

Question

I have a dataframe with account_id column.我有一个带有account_id列的 dataframe。 I want to group all of the distinct account_id rows and write to different S3 buckets.我想将所有不同的account_id行分组并写入不同的 S3 存储桶。 Writing to a new folder for each account_id within a given S3 bucket works too.为给定 S3 存储桶中的每个account_id写入新文件夹也可以。

Answer 1

If you want all similar account_ids to be present in one folder then you can achieve it via partitionBy function.如果您希望所有类似的 account_id 都出现在一个文件夹中，那么您可以通过partitionBy function 来实现。 Below is an example which will group all the account_ids and write them in parquet format to different folders.下面是一个示例，它将所有 account_id 分组并以 parquet 格式将它们写入不同的文件夹。 You can change the mode depending on your use case.您可以根据用例更改模式。

df.write.mode("overwrite").partitionBy('account_id').parquet('s3://mybucket/')

If you want multiple partitions then you can do so by adding the columns to partitionBy function.如果您想要多个分区，则可以通过将列添加到 partitionBy function 来实现。 For example consider you have a column date with values of format yyyy/mm/dd then below snippet will create folders again inside account_id with multiple dates.例如，假设您有一列日期，其值的格式为yyyy/mm/dd ，那么下面的代码片段将在account_id内再次创建具有多个日期的文件夹。

df.write.mode("overwrite").partitionBy('account_id','date').parquet('s3://mybucket/')

will write files to S3 in below format:将以以下格式将文件写入 S3：

s3://mybucket/account_id=somevalue/date=2020/11/01
s3://mybucket/account_id=somevalue/date=2020/11/02
s3://mybucket/account_id=somevalue/date=2020/11/03
......
s3://mybucket/account_id=somevalue/date=2020/11/30

如何根据 AWS Glue 作业中 dataframe 的不同值写入多个 S3 存储桶？

问题描述

1 个解决方案

解决方案1
2 2020-08-06 15:45:34

如何根据 AWS Glue 作业中 dataframe 的不同值写入多个 S3 存储桶？

问题描述

1 个解决方案

解决方案1 2 2020-08-06 15:45:34

解决方案1
2 2020-08-06 15:45:34