简体   繁体   English

结合hadoop map-reduce的结果

[英]Combining results from hadoop map-reduce

I have a Mapper<AvroKey<Email>, NullWritable, Text, Text> which effectively takes in an Email and multiple times spits out a key of an email address and a value of the field it was found on (from, to, cc, etc). 我有一个Mapper<AvroKey<Email>, NullWritable, Text, Text> ,它可以有效地接收电子邮件,并且多次吐出电子邮件地址的键和它在其上找到的字段的值(从,到,抄送,等等)。

Then I have a Reducer<Text, Text, NullWritable, Text> that takes in the email address and field name. 然后,我有一个Reducer<Text, Text, NullWritable, Text> ,其中包含电子邮件地址和字段名称。 It spits out a NullWritable key and a count of how many times the address is present in a given field. 它吐出一个NullWritable密钥,并计算该地址在给定字段中存在多少次。 eg.. 例如..

{
  "address": "joe.bloggs@gmail.com",
  "toCount": 12,
  "fromCount": 4
}

I'm using FileUtil.copyMerge to conflate the output from the jobs but (obviously) the results from different reducers aren't merged, so in practice I see: 我正在使用FileUtil.copyMerge合并作业的输出,但是(显然)不同的reducer的结果没有合并,因此在实践中我看到:

{
  "address": "joe.bloggs@gmail.com",
  "toCount": 12,
  "fromCount": 0
}, {
  "address": "joe.bloggs@gmail.com",
  "toCount": 0,
  "fromCount": 4
}

Is there a more sensible way of approaching this problem so I can get a single result per email address? 有没有更明智的方法来解决此问题,以便每个电子邮件地址都能得到一个结果? (I gather a combiner running pre-reduce phase is only run on a subset of the data and not guaranteed to give the results I want)? (我收集了一个运行预缩减阶段的合并器,它仅在数据的子集上运行,并且不能保证给出我想要的结果)?

Edit: 编辑:

Reducer code would be something like: Reducer代码如下所示:

public class EmailReducer extends Reducer<Text, Text, NullWritable, Text> {

    private static final ObjectMapper mapper = new ObjectMapper();

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        Map<String, Map<String, Object>> results = new HashMap<>();

        for (Text value : values) {
            if (!results.containsKey(value.toString())) {
                Map<String, Object> result = new HashMap<>();
                result.put("address", key.toString());
                result.put("to", 0);
                result.put("from", 0);

                results.put(value.toString(), result);
            }

            Map<String, Object> result = results.get(value.toString());

            switch (value.toString()) {
            case "TO":
                result.put("to", ((int) result.get("to")) + 1);
                break;
            case "FROM":
                result.put("from", ((int) result.get("from")) + 1);
                break;
        }

        results.values().forEach(result -> {
            context.write(NullWritable.get(),  new Text(mapper.writeValueAsString(result)));
        });
    }
}

Each input key of the reducer corresponds to a unique email address, so you don't need the results collection. reducer的每个输入键都对应一个唯一的电子邮件地址,因此您不需要results收集。 Each time the reduce method is called, it is for a distinct email address, so my suggestion is: 每次调用reduce方法时,它都是针对不同的电子邮件地址的,所以我的建议是:

public class EmailReducer extends Reducer<Text, Text, NullWritable, Text> {

  private static final ObjectMapper mapper = new ObjectMapper();

  public void reduce(Text key, Iterable<Text> values, Context context)
        throws IOException, InterruptedException {

    Map<String, Object> result = new HashMap<>(); 
    result.put("address", key.toString());
    result.put("to", 0);
    result.put("from", 0);

    for (Text value : values) {
        switch (value.toString()) {
        case "TO":
            result.put("to", ((int) result.get("to")) + 1);
            break;
        case "FROM":
            result.put("from", ((int) result.get("from")) + 1);
            break;
    }

    context.write(NullWritable.get(),  new Text(mapper.writeValueAsString(result)));

  }
}

I am not sure what the ObjectMapper class does, but I suppose that you need it to format the output. 我不确定ObjectMapper类的作用,但是我想您需要它来格式化输出。 Otherwise, I would print the input key as the output key (ie, the email address) and two concatenated counts for the "from" and "to" fields of each email address. 否则,我将输入键作为输出键(即电子邮件地址)打印出来,并将每个电子邮件地址的“发件人”和“收件人”字段的两个串联计数打印出来。

If your input is a data collection (ie, not streams, or smth similar), then you should get each email address only once. 如果您的输入是数据收集(即,不是流或类似的东西),则每个电子邮件地址应仅获得一次。 If your input is given in streams and you need to incrementally build your final output, then the output of one job can be the input of another. 如果您的输入是在流中给出的,并且您需要逐步构建最终输出,则一个作业的输出可以是另一个作业的输入。 If such is the case, I suggest using MultipleInputs, in which one Mapper is the one that you described earlier and another IdentityMapper, forwards the output of a previous job to the Reducer. 如果是这种情况,我建议使用MultipleInputs(其中一个Mapper是您先前描述的Mapper,另一个IdentityMapper)将前一个作业的输出转发给Reducer。 This way, again, the same email address is handled by the same reduce task. 同样,同一电子邮件地址由相同的reduce任务处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM