简体   繁体   English

Hadoop映射器:行与文件

[英]Hadoop mappers: lines vs files

I am learning Hadoop/mapreduce and have a question about various possibilities of splitting up mappers. 我正在学习Hadoop / mapreduce,并且对拆分映射器的各种可能性有疑问。

In the standard 'wordcount' scenario, each different process works on a unique line and does some basic math (addition). 在标准的“单词计数”方案中,每个不同的过程都在唯一的一行上工作,并执行一些基本的数学运算(加法)。 Is it possible, however, to have each process work on a unique file? 但是,是否有可能使每个进程都在一个唯一的文件上工作? For example, if I have 500,000 unique files, each of which is ~5M, can I tell Hadoop that each process should perform some analysis on each file and then perform statistics on the resulting analysis (for example, average the results together)? 例如,如果我有500,000个唯一文件,每个文件约为500万,我可以告诉Hadoop每个进程应该对每个文件执行一些分析,然后对结果分析进行统计(例如,将结果平均起来)吗?

For example, suppose each file contains: 例如,假设每个文件包含:

{name}
{data1}
{data2}
...
{dataN}

and I want to perform a mathematical function on this file to get F({name}) = [value1, value2, value3] based on {data1, ..., dataN} , and, at the end, I want to find the average of all possible [value1, value2, value3] arrays for each {name} . 并且我想对此文件执行数学函数以基于{data1, ..., dataN}来获取F({name}) = [value1, value2, value3] ,最后,我想找到每个{name}的所有可能的[value1, value2, value3]数组的平均值。 In this case, if I use Hadoop to work on each line, it will not help since each data must be associated with a name , so I would like to have Hadoop maintain knowledge of which name it is working with. 在这种情况下,如果我使用Hadoop在每一行上工作,那将无济于事,因为每个data必须与一个name相关联,因此我想让Hadoop维护其所使用的name知识。

If this is possible, would the calculation of F be the 'map' phase and then the averaging of [value1, value2, value3] arrays be the 'reduce' phase? 如果可能的话,将F的计算作为“ map”阶段,然后将[value1, value2, value3]数组的平均作为“ reduce”阶段吗?

So, to consolidate the question into a clear one-liner: How can I get Hadoop to split up work on files, rather than lines? 因此,将问题整合为一个明确的一类:如何让Hadoop拆分文件而不是行的工作?

We can get the filename and output that as mapper's output key. 我们可以获取文件名并将其输出为映射器的输出键。 The mappers output value can be the value like value1,value2, value3 etc. The snippet to get the file name is as follows 映射器的输出值可以是诸如value1,value2,value3等的值。获取文件名的代码段如下

InputSplit split = context.getInputSplit();
String fileName = split.getPath().getName();

In the reducer part we can iterate as per the key which is our filenames here and do the necessary operations like average,sum etc. The reducer output can have the filenames along with the value. 在化简器部分中,我们可以按照此处为文件名的键进行迭代,并执行必要的操作,例如平均值,总和等。化简器输出可以具有文件名和值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM