简体   繁体   中英

MapReduce on MongoDB collection is turning up empty

I've been trying to bring many large sets of data down into one collection, but I'm having issues writing a MapReduce function to get there.

This is what my data looks like (here are 17 rows, in reality I have 4+ million):

{"user": 1, "day": 1, "type": "a", "sum": 10}
{"user": 1, "day": 2, "type": "a", "sum": 32}
{"user": 1, "day": 1, "type": "b", "sum": 11}
{"user": 2, "day": 4, "type": "b", "sum": 2}
{"user": 1, "day": 2, "type": "b", "sum": 1}
{"user": 1, "day": 3, "type": "b", "sum": 9}
{"user": 1, "day": 4, "type": "b", "sum": 12}
{"user": 2, "day": 2, "type": "a", "sum": 3}
{"user": 3, "day": 2, "type": "b", "sum": 81}
{"user": 1, "day": 4, "type": "a", "sum": 22}
{"user": 1, "day": 5, "type": "a", "sum": 39}
{"user": 2, "day": 5, "type": "a", "sum": 8}
{"user": 2, "day": 3, "type": "b", "sum": 1}
{"user": 3, "day": 3, "type": "b", "sum": 99}
{"user": 2, "day": 3, "type": "a", "sum": 5}
{"user": 1, "day": 3, "type": "a", "sum": 41}
{"user": 3, "day": 4, "type": "b", "sum": 106}
...  

I'm trying to get it to look like this in the end (an array for each type, where the contents are just the sums in the appropriate index decided by the day, if that day doesn't exist for that type, it's just 0):

{"user": 1, "type_a_sums": [10, 32, 41, 22, 39], "type_b_sums": [11, 1, 9, 12, 0]}
{"user": 2, "type_a_sums": [0, 3, 5, 0, 8], "type_b_sums": [0, 0, 1, 2, 0]}
{"user": 3, "type_a_sums": [0, 0, 0, 0, 0], "type_b_sums": [0, 81, 99, 106, 0]}
...

This is the MapReduce I have been trying:

var mapsum = function(){
    var output = {user: this.user, type_a_sums: [0, 0, 0, 0, 0], type_b_sums: [0, 0, 0, 0, 0], tempType: this.type, tempSum: this.sum, tempDay: this.day}

    if(this.type == "a") {
        output.type_a_sums[this.day-1] = this.sum;
    }

    if(this.type == "b") {
        output.type_b_sums[this.day-1] = this.sum;
    }

    emit(this.user, output);
};

var r = function(key, values) {
    var outs = {user: 0, type_a_sums: [0, 0, 0, 0, 0], type_b_sums: [0, 0, 0, 0, 0], tempType: -1, tempSum: -1, tempDay: -1}

    values.forEach(function(v){

        outs.user = v.user;

        if(v.tempType == "a") {
            outs.type_a_sums[v.tempDay-1] = v.tempSum;
        }

        if(v.tempType == "b") {
            outs.type_b_sums[v.tempDay-1] = v.tempSum;
        }

    });

    return outs;
};


res = db.sums.mapReduce(mapsum, r, {out: 'joined_sums'})

This gives me my output on the small sample, but when I run it over all 4 million I get a ton of outputs that look like this:

{"user": 1, "type_a_sums": [0, 0, 0, 0, 0], "type_b_sums": [0, 0, 0, 0, 0]}
{"user": 2, "type_a_sums": [0, 3, 5, 0, 8], "type_b_sums": [0, 0, 1, 2, 0]}
{"user": 3, "type_a_sums": [0, 0, 0, 0, 0], "type_b_sums": [0, 0, 0, 0, 0]}

Where a large portion of users that should have sums in their arrays are actually just filled with the 0's that were in the dummy array in the reduce functions outs object before I fill it up with the actual function.

What's really weird is if I run the same exact function on the same collection but only check for one user res = db.sums.mapReduce(mapsum, r, {query: {user: 1}, out: 'joined_sums'}) that I know should have sums in their arrays but has been previously been turning up as all 0's, I will actually get the output I wanted for just that user. Run it again over all 4 million and I'm back to 0's everywhere. It's like it's just writing over all the work it did with the dummy filler arrays.

Do I have too much data? Shouldn't it be able to slog through it given the time? Or am I hitting some barrier I don't know about?

Thank you for including lots of detail. There are a few issues here.

Let's start from the top.

I'm trying to get it to look like this in the end

{"user": 2, "type_a_sums": [0, 3, 5, 0, 8], "type_b_sums": [0, 0, 1, 2, 0]}

It will actually look like this:

{ _id: { "user": 2 }, value: { "type_a_sums": [0, 3, 5, 0, 8], "type_b_sums": [0, 0, 1, 2, 0] }

Note that _id is kind of like your "group by" and value is kind of like your "sum" columns.

So problem #1 is that you are emitting user as your key, but it's also part of your value. This is not necessary. The reduce will only reduce two values that share the same key, you don't need this line either: outs.user = v.user;

You also have problem #2: your reduce is incorrect .

I think it has to do with reduce() being called more than once per key.

The goal of reduce() is that it will be called multiple times. It is supposed to scale across servers. So one server could call reduce a couple of times and those results could be merged and sent to another server.

Here's a different way to look at it. Reduce takes an array of value objects and reduces them to a single value object .

There are some corollaries here:

  • If I do reduce([a, b]) , it should be the same as reduce([b, a]) .
  • If I do reduce([a, reduce([b,c])) it should be the same as reduce([reduce([a,b]), c])

So it should not matter what order I run them in or how many times the value gets reduced, it's always the same output.

If you look at your code, this is not what's happening. Just take a look at the type_a_sums . What happens if I get the following two values coming to reduce?

reduce([ [0,0,1,0,0], [0,2,0,0,0] ]) => ???

To me, this looks like the output should be [0,2,1,0,0] . If this is true, then you don't need all of those temp_X fields. Instead, you need to focus on emit ing the correct arrays and then merging those arrays correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM