Mongodb MapReduce缺少文档

Question

我对map reduce有一个奇怪的情况。 结果虽然没有考虑，但并未考虑所有记录。

我收集了如下所示的推文。 我有230个文档-我的查询是在createdyear上的。 这是一个样本

{
    "_id" : ObjectId("56e55b52330dfb156547d559"),
    "message" : "RT @TwitFAKTA: Kiper MU, David De Gea mempunyai ritual unik sebelum bertanding, yaitu memutar lagu-lagu Metallica dengan keras.",
    "createdyear" : "2016",
    "handle" : "xxx",
    "createdtime" : "13:23:33",
    "searchtopic" : "Metallica",
    "createdmonth" : "03",
    "createddate" : "2016-03-13",
    "user" : "xxx"
}

我的地图功能就是这样。 很简单：最终结果应该是每个主题和每个月的推文计数。

function(){ 
    emit({topic: this.searchtopic, month: this.createdmonth},1) 
};

这里是reduce函数：我只是在计算给定键的值数量。

function(key,value) {
    var counter=0; 
    for (var i=0;i<value.length;i++) { 
        counter = counter +1; 
    }
    return counter; 
};

然后我映射减少并将输出存储在集合中。

db.tweets.mapReduce(map,reduce,{out: "mapreduce_test"})

结果是这样的：

{
    "result" : "mapreduce_test",
    "timeMillis" : 6,
    "counts" : {
        "input" : 230,
        "emit" : 230,
        "reduce" : 4,
        "output" : 2
    },
    "ok" : 1
}

地图缩小效果，但结果不正确。 当我列出mapreduce的输出时，我得到以下信息：

{ "_id" : { "topic" : "3 Doors Down", "month" : "03" }, "value" : 2 }
{ "_id" : { "topic" : "Metallica", "month" : "03" }, "value" : 31 }

手动搜索文档时，Metallica的得分为228，3 Doors Down的得分为2。 这些是230个输入和发出的记录。

那么其他文件在哪里呢？ 发生了什么？

通常，我有一个从Twitter获取推文并将其存储在mongodb中的过程。 因此，收藏总是越来越大。 当我通过cron定期运行mapreduce任务时，我注意到它工作了一段时间，然后突然返回错误的结果。 看一看：

Sun Mar 13 14:30:02 CET 2016
running mapreduce for topic: Metallica
{"name": "Metallica","data":[0, 0, 47.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
running mapreduce for topic: 3 Doors Down
{"name": "3 Doors Down","data":[0, 0, 2.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
writing output file: /home/uwe/development/highcharts/highcharts_tweets.html

Sun Mar 13 14:40:02 CET 2016
running mapreduce for topic: Metallica
{"name": "Metallica","data":[0, 0, 67.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
running mapreduce for topic: 3 Doors Down
{"name": "3 Doors Down","data":[0, 0, 2.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
writing output file: /home/uwe/development/highcharts/highcharts_tweets.html

Sun Mar 13 14:50:02 CET 2016
running mapreduce for topic: Metallica
{"name": "Metallica","data":[0, 0, 87.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
running mapreduce for topic: 3 Doors Down
{"name": "3 Doors Down","data":[0, 0, 2.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
writing output file: /home/uwe/development/highcharts/highcharts_tweets.html

Sun Mar 13 15:00:02 CET 2016
running mapreduce for topic: Metallica
{"name": "Metallica","data":[0, 0, 7.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
running mapreduce for topic: 3 Doors Down
{"name": "3 Doors Down","data":[0, 0, 2.0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
writing output file: /home/uwe/development/highcharts/highcharts_tweets.html

文档数量在增长，然后突然在15:00下降。 尽管我的文档仍然在数据库中-我多次检查了它。

我也已经在第二台机器上运行了，但是结果相同。

有人对此行为有解释吗？

谢谢，

乌韦

Answer 1

因为MongoDB的批次减少，你不能只是总结1在减少，你实际上需要总结value[i] ;

function(key,value) {
    var counter=0; 
    for (var i=0;i<value.length;i++) { 
        counter = counter + value[i]; 
    }
    return counter; 
};

假设批次大小为100。MongoDB在第一个批次中传递了100个值（总计为100），而在运行下一个批次时，它传递了101个值（到目前为止，一个值的总和为100 +新值）。

当您对1而不是value[i]求和时，总是将前一批的总和计为1 。

Mongodb MapReduce缺少文档

问题描述

1 个解决方案

解决方案1
1 2016-03-13 20:12:21

Mongodb MapReduce缺少文档

问题描述

1 个解决方案

解决方案1 1 2016-03-13 20:12:21

解决方案1
1 2016-03-13 20:12:21