简体繁体 English

MapReduce基础

[英]MapReduce basics

原文 2017-12-11 20:28:44 7 1 mapreduce/ input-split/ recordreader

I have a text file of 300mb with block size of 128mb. 我有一个300mb的文本文件，块大小为128mb。 So total 3 blocks 128+128+44 mb would be created. 因此总共将创建3个块128 + 128 + 44 mb。 Correct me - For map reduce default input split is same as block size that is 128mb which can be configured. 纠正我-对于map减少，默认输入分割与可配置的128mb块大小相同。 Now record reader will read through each split and create key value pair were key is offset and value is single line. 现在，记录读取器将读取每个拆分并创建键值对（键是偏移量，值是单行）。 (TextInputFormat) Question is if at last line of my block the block ends but the line does end in another block, will the rest of the line be taken from different node or will the remaining line run in another node. （TextInputFormat）问题是，如果块的最后一行结束了，但该行确实在另一个块中结束了，那么该行的其余部分将从另一个节点获取，还是剩余的行在另一个节点中运行。 Also how will the second node understand that its 1st line is already taken for processing and it dont need to process again. 同样，第二个节点将如何理解其第一行已被处理，而无需再次处理。

Eg This is stackoverflow.This (end of block 1/input split) is a map reduce example. 例如，这是stackoverflow。这（块1 /输入拆分的末尾）是一个map reduce的例子。 (end of line) （行结束）

1 个解决方案

3 mapper will be generated for this scenario. 在这种情况下将生成3个映射器。 Hadoop uses a pointer at the end of every block which indicates the location of next block , so mapper 1 will processed the complete line , which may be the part of block 2 and mapper 2 will start processing by leaving that line. Hadoop在每个块的末尾使用一个指针，指示下一个块的位置，因此映射器1将处理整行，这可能是块2的一部分，而映射器2将通过离开该行开始处理。