简体繁体 English

如何将 dynamodb 作为单个文件导出到 s3？

[英]How to export dynamodb to s3 as a single file?

原文 2022-05-09 12:48:58 0 1 apache-spark/ amazon-s3/ amazon-data-pipeline

I have a dynamodb table which will need to be exported to a s3 bucket every 24 hours using data pipeline.我有一个 dynamodb 表，需要每 24 小时使用数据管道将其导出到 s3 存储桶。 This will in turn be used by a sparkjob to query the data.这将反过来被 sparkjob 用来查询数据。

The problem is that whenever I am setting up a data pipeline to do this activity, the output in s3 is multiple partitioned files.问题是，每当我设置数据管道来执行此活动时，s3 中的输出都是多个分区文件。

Is there a way to ensure that the entire table is exported as a single file in s3?有没有办法确保整个表在 s3 中导出为单个文件？ If not, is there a way in spark to read the partitioned files using manifest and combine them into one to query the data?如果没有，spark 有没有办法使用清单读取分区文件并将它们组合成一个来查询数据？

1 个解决方案

You have two options here (The function should be run on the dataframe just before writing):您在这里有两个选项（该函数应该在写入之前在数据帧上运行）：

repartition(1)
coalesce(1)

But as the docs emphasized the better in your case is the repartition :但是正如文档强调的那样，在您的情况下更好的是repartition ：

However, if you're doing a drastic coalesce, eg to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (eg one node in the case of numPartitions = 1).然而，如果你正在做一个剧烈的合并，例如 numPartitions = 1，这可能会导致你的计算发生在比你喜欢的更少的节点上（例如，在 numPartitions = 1 的情况下是一个节点）。 To avoid this, you can call repartition().为避免这种情况，您可以调用 repartition()。 This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).这将添加一个 shuffle 步骤，但意味着当前的上游分区将并行执行（无论当前分区是什么）。

Docs:文件：

repartition 重新分区

coalesce 合并

如何将数据框中的数据写入单个.parquet文件（单个文件中的数据和元数据）到Amazon S3？ - How to write data in the dataframe into single .parquet file(both data & metadata in single file) to Amazon S3?

使用Spark在S3上处理单个文件 - Processing single file on S3 using Spark

如何使用spark将文件从本地复制到s3作为具有给定名称的单个文件？ - How to copy a file from local to s3 using spark as a single file with the given name?

将大型Spark Dataframe保存为S3中的单个json文件 - Save a large Spark Dataframe as a single json file in S3

将数据帧的每个分区保存到 HDFS/S3 中的单独单个文件中 - Save each partition of dataframe into seperate single file in HDFS/S3

如何从S3读取多个gzip压缩文件到一个RDD？ - How to read multiple gzipped files from S3 into a single RDD?

如何将2TB表从RDS实例导出到S3或Hive？ - How to export a 2TB table from a RDS instance to S3 or Hive?

循环具有相同架构的 s3 parquet 文件路径序列并保存在 scala 中的单个 dataframe 中 - loop a sequence of s3 parquet file path with same schema and save in a single dataframe in scala

Spark：如何覆盖 S3 文件夹上的文件而不是完整的文件夹 - Spark: How to overwrite a file on S3 folder and not complete folder

如何将AsTable保存到s3？ - How to saveAsTable to s3?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将数据框中的数据写入单个.parquet文件（单个文件中的数据和元数据）到Amazon S3？ - How to write data in the dataframe into single .parquet file(both data & metadata in single file) to Amazon S3? 使用Spark在S3上处理单个文件 - Processing single file on S3 using Spark 如何使用spark将文件从本地复制到s3作为具有给定名称的单个文件？ - How to copy a file from local to s3 using spark as a single file with the given name? 将大型Spark Dataframe保存为S3中的单个json文件 - Save a large Spark Dataframe as a single json file in S3 将数据帧的每个分区保存到 HDFS/S3 中的单独单个文件中 - Save each partition of dataframe into seperate single file in HDFS/S3 如何从S3读取多个gzip压缩文件到一个RDD？ - How to read multiple gzipped files from S3 into a single RDD? 如何将2TB表从RDS实例导出到S3或Hive？ - How to export a 2TB table from a RDS instance to S3 or Hive? 循环具有相同架构的 s3 parquet 文件路径序列并保存在 scala 中的单个 dataframe 中 - loop a sequence of s3 parquet file path with same schema and save in a single dataframe in scala Spark：如何覆盖 S3 文件夹上的文件而不是完整的文件夹 - Spark: How to overwrite a file on S3 folder and not complete folder 如何将AsTable保存到s3？ - How to saveAsTable to s3?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM