简体   繁体   English

如何根据 AWS Glue 作业中 dataframe 的不同值写入多个 S3 存储桶?

[英]How to write to multiple S3 buckets based on distinct values of a dataframe in an AWS Glue job?

I have a dataframe with account_id column.我有一个带有account_id列的 dataframe。 I want to group all of the distinct account_id rows and write to different S3 buckets.我想将所有不同的account_id行分组并写入不同的 S3 存储桶。 Writing to a new folder for each account_id within a given S3 bucket works too.为给定 S3 存储桶中的每个account_id写入新文件夹也可以。

If you want all similar account_ids to be present in one folder then you can achieve it via partitionBy function.如果您希望所有类似的 account_id 都出现在一个文件夹中,那么您可以通过partitionBy function 来实现。 Below is an example which will group all the account_ids and write them in parquet format to different folders.下面是一个示例,它将所有 account_id 分组并以 parquet 格式将它们写入不同的文件夹。 You can change the mode depending on your use case.您可以根据用例更改模式。

df.write.mode("overwrite").partitionBy('account_id').parquet('s3://mybucket/')

If you want multiple partitions then you can do so by adding the columns to partitionBy function.如果您想要多个分区,则可以通过将列添加到 partitionBy function 来实现。 For example consider you have a column date with values of format yyyy/mm/dd then below snippet will create folders again inside account_id with multiple dates.例如,假设您有一列日期,其值的格式为yyyy/mm/dd ,那么下面的代码片段将在account_id内再次创建具有多个日期的文件夹。

df.write.mode("overwrite").partitionBy('account_id','date').parquet('s3://mybucket/')

will write files to S3 in below format:将以以下格式将文件写入 S3:

s3://mybucket/account_id=somevalue/date=2020/11/01
s3://mybucket/account_id=somevalue/date=2020/11/02
s3://mybucket/account_id=somevalue/date=2020/11/03
......
s3://mybucket/account_id=somevalue/date=2020/11/30

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM