简体   繁体   中英

Combining results from hadoop map-reduce

I have a Mapper<AvroKey<Email>, NullWritable, Text, Text> which effectively takes in an Email and multiple times spits out a key of an email address and a value of the field it was found on (from, to, cc, etc).

Then I have a Reducer<Text, Text, NullWritable, Text> that takes in the email address and field name. It spits out a NullWritable key and a count of how many times the address is present in a given field. eg..

{
  "address": "joe.bloggs@gmail.com",
  "toCount": 12,
  "fromCount": 4
}

I'm using FileUtil.copyMerge to conflate the output from the jobs but (obviously) the results from different reducers aren't merged, so in practice I see:

{
  "address": "joe.bloggs@gmail.com",
  "toCount": 12,
  "fromCount": 0
}, {
  "address": "joe.bloggs@gmail.com",
  "toCount": 0,
  "fromCount": 4
}

Is there a more sensible way of approaching this problem so I can get a single result per email address? (I gather a combiner running pre-reduce phase is only run on a subset of the data and not guaranteed to give the results I want)?

Edit:

Reducer code would be something like:

public class EmailReducer extends Reducer<Text, Text, NullWritable, Text> {

    private static final ObjectMapper mapper = new ObjectMapper();

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        Map<String, Map<String, Object>> results = new HashMap<>();

        for (Text value : values) {
            if (!results.containsKey(value.toString())) {
                Map<String, Object> result = new HashMap<>();
                result.put("address", key.toString());
                result.put("to", 0);
                result.put("from", 0);

                results.put(value.toString(), result);
            }

            Map<String, Object> result = results.get(value.toString());

            switch (value.toString()) {
            case "TO":
                result.put("to", ((int) result.get("to")) + 1);
                break;
            case "FROM":
                result.put("from", ((int) result.get("from")) + 1);
                break;
        }

        results.values().forEach(result -> {
            context.write(NullWritable.get(),  new Text(mapper.writeValueAsString(result)));
        });
    }
}

Each input key of the reducer corresponds to a unique email address, so you don't need the results collection. Each time the reduce method is called, it is for a distinct email address, so my suggestion is:

public class EmailReducer extends Reducer<Text, Text, NullWritable, Text> {

  private static final ObjectMapper mapper = new ObjectMapper();

  public void reduce(Text key, Iterable<Text> values, Context context)
        throws IOException, InterruptedException {

    Map<String, Object> result = new HashMap<>(); 
    result.put("address", key.toString());
    result.put("to", 0);
    result.put("from", 0);

    for (Text value : values) {
        switch (value.toString()) {
        case "TO":
            result.put("to", ((int) result.get("to")) + 1);
            break;
        case "FROM":
            result.put("from", ((int) result.get("from")) + 1);
            break;
    }

    context.write(NullWritable.get(),  new Text(mapper.writeValueAsString(result)));

  }
}

I am not sure what the ObjectMapper class does, but I suppose that you need it to format the output. Otherwise, I would print the input key as the output key (ie, the email address) and two concatenated counts for the "from" and "to" fields of each email address.

If your input is a data collection (ie, not streams, or smth similar), then you should get each email address only once. If your input is given in streams and you need to incrementally build your final output, then the output of one job can be the input of another. If such is the case, I suggest using MultipleInputs, in which one Mapper is the one that you described earlier and another IdentityMapper, forwards the output of a previous job to the Reducer. This way, again, the same email address is handled by the same reduce task.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM