简体   繁体   English

如何将 dynamodb 作为单个文件导出到 s3?

[英]How to export dynamodb to s3 as a single file?

I have a dynamodb table which will need to be exported to a s3 bucket every 24 hours using data pipeline.我有一个 dynamodb 表,需要每 24 小时使用数据管道将其导出到 s3 存储桶。 This will in turn be used by a sparkjob to query the data.这将反过来被 sparkjob 用来查询数据。

The problem is that whenever I am setting up a data pipeline to do this activity, the output in s3 is multiple partitioned files.问题是,每当我设置数据管道来执行此活动时,s3 中的输出都是多个分区文件。

Is there a way to ensure that the entire table is exported as a single file in s3?有没有办法确保整个表在 s3 中导出为单个文件? If not, is there a way in spark to read the partitioned files using manifest and combine them into one to query the data?如果没有,spark 有没有办法使用清单读取分区文件并将它们组合成一个来查询数据?

You have two options here (The function should be run on the dataframe just before writing):您在这里有两个选项(该函数应该在写入之前在数据帧上运行):

  1. repartition(1)
  2. coalesce(1)

But as the docs emphasized the better in your case is the repartition :但是正如文档强调的那样,在您的情况下更好的是repartition

However, if you're doing a drastic coalesce, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1).然而,如果你正在做一个剧烈的合并,例如 numPartitions = 1,这可能会导致你的计算发生在比你喜欢的更少的节点上(例如,在 numPartitions = 1 的情况下是一个节点)。 To avoid this, you can call repartition().为避免这种情况,您可以调用 repartition()。 This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).这将添加一个 shuffle 步骤,但意味着当前的上游分区将并行执行(无论当前分区是什么)。

Docs:文件:

repartition 重新分区

coalesce 合并

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将数据框中的数据写入单个.parquet文件(单个文件中的数据和元数据)到Amazon S3? - How to write data in the dataframe into single .parquet file(both data & metadata in single file) to Amazon S3? 使用Spark在S3上处理单个文件 - Processing single file on S3 using Spark 如何使用spark将文件从本地复制到s3作为具有给定名称的单个文件? - How to copy a file from local to s3 using spark as a single file with the given name? 将大型Spark Dataframe保存为S3中的单个json文件 - Save a large Spark Dataframe as a single json file in S3 将数据帧的每个分区保存到 HDFS/S3 中的单独单个文件中 - Save each partition of dataframe into seperate single file in HDFS/S3 如何从S3读取多个gzip压缩文件到一个RDD? - How to read multiple gzipped files from S3 into a single RDD? 如何将2TB表从RDS实例导出到S3或Hive? - How to export a 2TB table from a RDS instance to S3 or Hive? 循环具有相同架构的 s3 parquet 文件路径序列并保存在 scala 中的单个 dataframe 中 - loop a sequence of s3 parquet file path with same schema and save in a single dataframe in scala Spark:如何覆盖 S3 文件夹上的文件而不是完整的文件夹 - Spark: How to overwrite a file on S3 folder and not complete folder 如何将AsTable保存到s3? - How to saveAsTable to s3?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM