简体繁体 English

如何划分map-reduce任务？

[英]How to divide map-reduce tasks?

原文 2017-02-13 12:54:56 7 1 python/ hadoop/ mapreduce/ hadoop-streaming

I have a table containing 200 columns out of which I need around 50 column mentioned in a list, and rows of last 24 months according to column 'timestamp'. 我有一个表，其中包含200列，其中我需要列表中提到的约50列，以及根据“时间戳”列的最近24个月的行。

I'm confused what comes under mapper and what under reducer? 我很困惑mapper下的是什么，reduce下的是什么？

As it is just transformation, will it only have mapper phase, or filtering of rows to last 24 months will come under reducer? 由于它只是转换，它仅具有映射器阶段，还是对reducer进行持续24个月的行过滤？ I'm not sure if this exactly utilises what map-reduce was made for. 我不确定这是否完全利用了map-reduce的目的。

I'm using python with hadoop streaming. 我正在将Python与hadoop流一起使用。

1 个解决方案

So, your have a table with 200 columns(say T), a separate list of entries(say L) to be picked from T and with the last 24-hours(from the timestamp in T). 因此，您有一个表，该表包含200列（例如T），还有一个单独的条目列表（例如L），该表要从T中选择，最后24小时（从T中的时间戳记开始）。

MapReduce, mapper does give entries from T sequentially. MapReduce，映射器确实从T顺序给出条目。 Before your mapper gets into map(), Ie in setup() put the block of code to read from the L and make it handy(use a feasible data structure to hold the list of data). 在您的映射器进入map（）之前，即setup（）中的代码块从L读取并方便使用（使用可行的数据结构保存数据列表）。 Now, your code should hold two checks/conditions 1) if the entry from T contains/matches with L. If so, then check 2) if the data is within 24-hours range. 现在，您的代码应包含两个检查/条件：1）T中的条目是否包含/与L匹配。如果是，则检查2）数据是否在24小时范围内。

Done. 完成。 Your output is what you have expected. 您的输出是您所期望的。 No, reducer is required here, at least to do this much. 不，在这里至少需要做减速器。

Happy Mapreducing. 快乐减少地图。