简体繁体 English

粉碎HDFS中的小文件

[英]Crushing small files in HDFS

原文 2014-04-16 02:57:08 0 1 hadoop/ hdfs/ apache-spark/ cloudera-cdh

We run Spark 0.9.1 on Mesos 0.17 against CDH5. 我们针对CDH5在Mesos 0.17上运行Spark 0.9.1。 Until now, we have continued using the 'mr1' version of the CDH series so that we could run the filecrush project on our smaller files. 到目前为止，我们一直在使用CDH系列的'mr1'版本，以便可以在较小的文件上运行filecrush项目。 For various reasons, we would like to have the freedom to upgrade to MR-2. 由于各种原因，我们希望有升级到MR-2的自由。

Do any tools exist for doing this outside of Hadoop's map/reduce? 是否有任何工具可以在Hadoop的map / reduce之外执行此操作？ The filecrush library we use today is non-trivial, so translating the pattern to Spark did not seem straightforward. 我们今天使用的filecrush库很简单，因此将模式转换为Spark似乎并不简单。

1 个解决方案

MR1 code usually works with no changes (or very few) with a recompile against MR2 libraries. 通过针对MR2库进行重新编译，MR1代码通常不会更改（或很少更改）。 Does that not work? 那行不通吗？ This is probably quite straightforward. 这可能很简单。

You wouldn't translate this quite directly to Spark but you can probably achieve a similar effect quite easily by mapping a bunch of files and outputting the result with a different partitioning. 您不会将其直接转换为Spark，但可以通过映射一堆文件并使用不同的分区输出结果来轻松实现类似的效果。 You may just run into the same issues as Spark is going to use HDFS and its InputFormat s to read your data into splits, and that is kinda where your problem is coming from to begin with. 您可能会遇到与Spark将使用HDFS及其InputFormat将数据读取为拆分相同的问题，这也正是您问题InputFormat 。