简体   繁体   English

Mongodb mapreduce优化

[英]Mongodb mapreduce optimization

I have a collection of hits stored on Mongodb with this schema: { userid: ... date: ... } 我有一个使用此模式存储在Mongodb上的命中集合:{userid:... date:...}

I want to display a report with computation of unique visitors between two dates (visitors with different userid who have made a hit between these dates). 我想显示一个报告,计算两个日期之间的唯一访问者(具有不同用户ID的访问者在这些日期之间进行了点击)。

Example of output: 输出示例:

Number of visitors: ... Number of hits: ... 访客人数:...点击次数:...

The collection's size is about 1M records. 该集合的大小约为1M记录。

My first idea is to do incremental mapreduce to compute aggregated values by day. 我的第一个想法是使用增量mapreduce来计算白天的聚合值。 And then a second mapreduce on the days to output the final result. 然后在输出最终结果的日子里进行第二次mapreduce。

The problem is when a select a range of dates on the report, i'm not able to compute the correct number of unique visitors. 问题是当在报告上选择一系列日期时,我无法计算正确数量的唯一身份访问者。

Example of aggregated values by day: Day 1: 1 unique visitors Day 2: 2 unique visitors (1 of the 2 visitors has made a hit on day 1) 白天汇总值的示例:第1天:1个唯一身份访问者第2天:2位唯一身份访问者(2位访客中有1位在第1天受到影响)

The sum of unique visitors is 3 on the two days but the whole period there are only 2 unique visitors and not 3. 这两天的独立访客总数为3,但整个时期只有2位独立访客,而不是3位。

Have you any performant way to compute unique visitors on this example? 您是否有任何高效的方法来计算此示例中的唯一身份访问者?

This problem might be easier to solve by using a single map-reduce over the desired dates. 通过在所需日期使用单个map-reduce可能更容易解决此问题。 Instead of first aggregating the unique users for a single day (your first step), you could do this same aggregation over all of the dates you wish to check. 您可以在要检查的所有日期执行相同的聚合,而不是首先聚合一天中的唯一身份用户(您的第一步)。 In this way you can avoid the second step entirely. 通过这种方式,您可以完全避免第二步。

To break this down into the Map and Reduce sections: 要将其分解为Map和Reduce部分:

Map: Find all of the userids that were recorded during the desired time range 映射:查找在所需时间范围内记录的所有用户ID

Reduce: Remove all duplicated userids Reduce:删除所有重复的用户ID

Once this process is complete you should be left with the set of unique visitors (more specifically, unique userids) for that time range. 完成此过程后,您应该留下该时间范围内的唯一访问者集(更具体地说,唯一的用户ID)。

Alternately, there is an even easier way to do this that does not require map-reduce at all. 或者,有一种更简单的方法可以完全不需要map-reduce。 The "distinct" command (see the mongoDB distinct documentation ) allows you to select a field and return an array filled with only distinct (unique) values for that field. “distinct”命令(请参阅mongoDB distinct文档 )允许您选择一个字段并返回一个仅填充该字段的不同(唯一)值的数组。 If you used the distinct command on the documents within the desired time range, you will be able to get an array that contains all the userids from that period without any duplicates. 如果在所需时间范围内对文档使用了distinct命令,则可以获得包含该时间段内所有用户ID的数组,而不会出现任何重复项。

Hope this helps! 希望这可以帮助!

You can do this easily with version 2.2 and its aggregation framework. 您可以使用2.2版及其聚合框架轻松完成此操作。

Assuming schema {userid: " ", date: " "} and given two specific dates d1 and d2 this is the pipeline: 假设schema {userid:“”,date:“”}并给出两个特定日期d1和d2,这就是管道:

db.collection.aggregate(
[
    {
        "$match" : {
            "date" : {
                "$gte" : d1,
                "$lte" : d2
            }
        }
    },
    {
        "$group" : {
            "_id" : "$userid",
            "hits" : {
                "$sum" : 1
            }
        }
    },
    {
        "$group" : {
            "_id" : "1",
            "visitors" : {
                "$sum" : 1
            },
            "hits" : {
                "$sum" : "$hits"
            }
        }
    },
    {
        "$project" : {
            "_id" : 0,
            "visitors" : 1,
            "hits" : 1
        }
    }
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM