简体繁体 English

Hadoop InputSplit用于大型基于文本的文件

[英]Hadoop InputSplit for large text-based files

原文 2014-05-10 21:54:17 2 1 java/ hadoop

In hadoop I'd like to split a file (almost) equally to each mapper. 在hadoop中，我想将文件（几乎）平均分配给每个映射器。 The file is large and I want to use a specific number of mappers at which are defined at job start. 该文件很大，我想使用作业开始时定义的特定数量的映射器。 Now I've customized the input split but I want to be sure that if I split the file in two (or more splits) I won't cut a line in half as I want each mapper to have complete lines and not broken ones. 现在，我已经自定义了输入拆分，但是我想确定的是，如果我将文件拆分为两个（或更多个拆分），则不会将行切成两半，因为我希望每个映射器都有完整的行而不是断行。

So the question is this, how can I get the approximate size of a filesplit during each creation or if that is not possible how I can estimate the number of (almost) equal filesplits for a large file given the constraint that I don't want to have any broken lines in any mapper instance. 所以问题是这样的，如何在每次创建过程中获得文件拆分的近似大小，或者如果不可能，在给定我不想要的约束的情况下，如何估算大型文件的（几乎）相等的文件拆分的数量在任何映射器实例中都有虚线。

1 个解决方案

Everything that you are asking for is the default behavior in Map Reduce. 您要求的所有内容都是Map Reduce的默认行为。 Like mappers always process complete lines. 就像映射器一样，总是处理完整的行。 By default Map Reduce strives to spread out the load among st mappers evenly. 默认情况下，Map Reduce努力在st映射器之间平均分配负载。

You can get more details about it here you can check out the InputSplits para. 您可以在此处获得有关它的更多详细信息，您可以查看InputSplits参数。

Also this answer here as linked by @Shaw, talks about how exactly the case of lines spread across blocks splits is handled. 另外这个答案在这里通过@Shaw链接，如何准确跨越块分割线蔓延的情况下，办理了会谈。

A think a through reading of the hadoop bible should clear out most of your doubts in thsi regard 通过仔细阅读hadoop圣经的思想应该可以清除您在这方面的大部分疑问