简体   繁体   English

“分区数据”是什么意思 - S3

[英]what does it mean "partitioned data" - S3

I want to use Netflix's outputCommitter (Using Spark with Amazon EMR).我想使用 Netflix 的 outputCommitter(将 Spark 与 Amazon EMR 结合使用)。 In the README there are 2 options:在 README 中有 2 个选项:

  1. S3DirectoryOutputCommitter - for writing unpartitioned data to S3 with conflict resolution. S3DirectoryOutputCommitter - 用于将未分区数据写入 S3 并解决冲突。
  2. S3PartitionedOutputCommitter - for writing partitioned data to S3 with conflict resolution. S3PartitionedOutputCommitter - 用于将分区数据写入 S3 并解决冲突。

I tried to understand the differences but unsuccessfully.我试图理解这些差异,但没有成功。 Can someone explain what is "partitioned data" in s3?有人可以解释什么是 s3 中的“分区数据”吗?

according to the hadoop docs, "This committer an extension of the “Directory” committer which has a special conflict resolution policy designed to support operations which insert new data into a directory tree structured using Hive's partitioning strategy: different levels of the tree represent different columns."根据 hadoop 文档,“这个提交器是“目录”提交器的扩展,它有一个特殊的冲突解决策略,旨在支持将新数据插入到使用 Hive 的分区策略构建的目录树中的操作:树的不同级别代表不同的列”

search in the hadoop docs for the full details.在 hadoop 文档中搜索完整的详细信息。

be aware that the EMR committers are not the ASF S3A ones, so take different config options and have their own docs.请注意,EMR 提交者不是 ASF S3A 提交者,因此请采用不同的配置选项并拥有自己的文档。 but since their work is a reimplementation of the.netflix work, they should do the same thing here但由于他们的工作是 .netflix 工作的重新实现,他们应该在这里做同样的事情

I'm not familiar with outputCommitter , by partitioned data in Amazon S3 normally refers to splitting files amongst directories to reduce the amount of data that needs to be read from disk.我对outputCommitter不熟悉,Amazon S3 中的分区数据通常是指在目录之间拆分文件以减少需要从磁盘读取的数据量。

For example:例如:

/data/month=1/
/data/month=2/
/data/month=3/
...

If a Hive-type query is run against the data with a clause like WHERE month=1 , then it would only need to look in the month=1/ subdirectory, thereby saving 2/3rds of disk access.如果使用WHERE month=1之类的子句对数据运行 Hive 类型的查询,那么它只需要在month=1/子目录中查找,从而节省 2/3 的磁盘访问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 非双栈在 s3 的上下文中意味着什么? - What does non-dualstack mean in the context of s3? 如何基于S3分区数据在snowflake中创建外部表 - How to Create external table in snowflake based on S3 partitioned data PyArrow:如何将数据从 mongo 批处理到 S3 中的分区镶木地板 - PyArrow: How to batch data from mongo into partitioned parquet in S3 aws S3 ListObjectsV2 api 中的 start-after 是什么意思? - What does start-after in aws S3 ListObjectsV2 api mean? 如何使用 python 从 AWS S3 读取在列上分区的镶木地板文件数据 - How to read parquet file data partitioned on column from AWS S3 using python 使用 s3 和胶水时无法以冰山格式保存分区数据 - Unable to save partitioned data in in iceberg format when using s3 and glue 密封 Azure Data Explorer 范围是什么意思? - What does it mean for an Azure Data Explorer extent to be sealed? 增量表:仅从 S3 存储桶复制到特定的分区文件夹 - Delta table : COPY INTO only specific partitioned folders from S3 bucket 在 AWS Glue ETL 作业中从 S3 加载分区的 json 文件 - Load partitioned json files from S3 in AWS Glue ETL jobs 控制 S3 的最佳方式是什么? - What's the best way to control S3?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM