[英]what does it mean "partitioned data" - S3
I want to use Netflix's outputCommitter (Using Spark with Amazon EMR).我想使用 Netflix 的 outputCommitter(将 Spark 与 Amazon EMR 结合使用)。 In the README there are 2 options:在 README 中有 2 个选项:
I tried to understand the differences but unsuccessfully.我试图理解这些差异,但没有成功。 Can someone explain what is "partitioned data" in s3?有人可以解释什么是 s3 中的“分区数据”吗?
according to the hadoop docs, "This committer an extension of the “Directory” committer which has a special conflict resolution policy designed to support operations which insert new data into a directory tree structured using Hive's partitioning strategy: different levels of the tree represent different columns."根据 hadoop 文档,“这个提交器是“目录”提交器的扩展,它有一个特殊的冲突解决策略,旨在支持将新数据插入到使用 Hive 的分区策略构建的目录树中的操作:树的不同级别代表不同的列”
search in the hadoop docs for the full details.在 hadoop 文档中搜索完整的详细信息。
be aware that the EMR committers are not the ASF S3A ones, so take different config options and have their own docs.请注意,EMR 提交者不是 ASF S3A 提交者,因此请采用不同的配置选项并拥有自己的文档。 but since their work is a reimplementation of the.netflix work, they should do the same thing here但由于他们的工作是 .netflix 工作的重新实现,他们应该在这里做同样的事情
I'm not familiar with outputCommitter , by partitioned data in Amazon S3 normally refers to splitting files amongst directories to reduce the amount of data that needs to be read from disk.我对outputCommitter不熟悉,Amazon S3 中的分区数据通常是指在目录之间拆分文件以减少需要从磁盘读取的数据量。
For example:例如:
/data/month=1/
/data/month=2/
/data/month=3/
...
If a Hive-type query is run against the data with a clause like WHERE month=1
, then it would only need to look in the month=1/
subdirectory, thereby saving 2/3rds of disk access.如果使用WHERE month=1
之类的子句对数据运行 Hive 类型的查询,那么它只需要在month=1/
子目录中查找,从而节省 2/3 的磁盘访问。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.