[英]How to export dynamodb to s3 as a single file?
I have a dynamodb table which will need to be exported to a s3 bucket every 24 hours using data pipeline.我有一个 dynamodb 表,需要每 24 小时使用数据管道将其导出到 s3 存储桶。 This will in turn be used by a sparkjob to query the data.
这将反过来被 sparkjob 用来查询数据。
The problem is that whenever I am setting up a data pipeline to do this activity, the output in s3 is multiple partitioned files.问题是,每当我设置数据管道来执行此活动时,s3 中的输出都是多个分区文件。
Is there a way to ensure that the entire table is exported as a single file in s3?有没有办法确保整个表在 s3 中导出为单个文件? If not, is there a way in spark to read the partitioned files using manifest and combine them into one to query the data?
如果没有,spark 有没有办法使用清单读取分区文件并将它们组合成一个来查询数据?
You have two options here (The function should be run on the dataframe just before writing):您在这里有两个选项(该函数应该在写入之前在数据帧上运行):
repartition(1)
coalesce(1)
But as the docs emphasized the better in your case is the repartition
:但是正如文档强调的那样,在您的情况下更好的是
repartition
:
However, if you're doing a drastic coalesce, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1).
然而,如果你正在做一个剧烈的合并,例如 numPartitions = 1,这可能会导致你的计算发生在比你喜欢的更少的节点上(例如,在 numPartitions = 1 的情况下是一个节点)。 To avoid this, you can call repartition().
为避免这种情况,您可以调用 repartition()。 This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
这将添加一个 shuffle 步骤,但意味着当前的上游分区将并行执行(无论当前分区是什么)。
Docs:文件:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.