python mongodb $match and $group

Question

I want to write a simple query that gives me the user with the most followers that has the timezone brazil and has tweeted 100 or more times:

this is my line :

pipeline = [{'$match':{"user.statuses_count":{"$gt":99},"user.time_zone":"Brasilia"}},
            {"$group":{"_id": "$user.followers_count","count" :{"$sum":1}}},
            {"$sort":{"count":-1}} ]

I adapted it from a practice problem.

This was given as an example for the structure :
    {
    "_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
    "text" : "First week of school is over :P",
    "in_reply_to_status_id" : null,
    "retweet_count" : null,
    "contributors" : null,
    "created_at" : "Thu Sep 02 18:11:25 +0000 2010",
    "geo" : null,
    "source" : "web",
    "coordinates" : null,
    "in_reply_to_screen_name" : null,
    "truncated" : false,
    "entities" : {
        "user_mentions" : [ ],
        "urls" : [ ],
        "hashtags" : [ ]
    },
    "retweeted" : false,
    "place" : null,
    "user" : {
        "friends_count" : 145,
        "profile_sidebar_fill_color" : "E5507E",
        "location" : "Ireland :)",
        "verified" : false,
        "follow_request_sent" : null,
        "favourites_count" : 1,
        "profile_sidebar_border_color" : "CC3366",
        "profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
        "geo_enabled" : false,
        "created_at" : "Sun May 03 19:51:04 +0000 2009",
        "description" : "",
        "time_zone" : null,
        "url" : null,
        "screen_name" : "Catherinemull",
        "notifications" : null,
        "profile_background_color" : "FF6699",
        "listed_count" : 77,
        "lang" : "en",
        "profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
        "statuses_count" : 2475,
        "following" : null,
        "profile_text_color" : "362720",
        "protected" : false,
        "show_all_inline_media" : false,
        "profile_background_tile" : true,
        "name" : "Catherine Mullane",
        "contributors_enabled" : false,
        "profile_link_color" : "B40B43",
        "followers_count" : 169,
        "id" : 37486277,
        "profile_use_background_image" : true,
        "utc_offset" : null
    },
    "favorited" : false,
    "in_reply_to_user_id" : null,
    "id" : NumberLong("22819398300")
}

Can anybody spot my mistakes?

Answer 1

Suppose you have a couple of sample documents with the minimum test case. Insert the test documents to a collection in mongoshell:

db.collection.insert([
{
    "_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
    "user" : {
        "friends_count" : 145,
        "statuses_count" : 457,
        "screen_name" : "Catherinemull",
        "time_zone" : "Brasilia",
        "followers_count" : 169,
        "id" : 37486277
    },
    "id" : NumberLong(22819398300)
},
{
    "_id" : ObjectId("52fd2490bac3fa1975477702"),
    "user" : {
        "friends_count" : 145,
        "statuses_count" : 12334,
        "time_zone" : "Brasilia",
        "screen_name" : "marble",
        "followers_count" : 2597,
        "id" : 37486278
    },
    "id" : NumberLong(22819398301)
}])

For you to get the user with the most followers that is in the timezone "Brasilia" and has tweeted 100 or more times, this pipeline achieves the desired result but doesn't use the $group operator:

pipeline = [
    {
        "$match": {
            "user.statuses_count": {
                "$gt":99 
            }, 
            "user.time_zone": "Brasilia"
        }
    },
    {
        "$project": {                
            "followers": "$user.followers_count",
            "screen_name": "$user.screen_name",
            "tweets": "$user.statuses_count"
        }
    },
    {
        "$sort": { 
            "followers": -1 
        }
    },
    {"$limit" : 1}
]

Pymongo Output :

{u'ok': 1.0,
 u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
              u'followers': 2597,
              u'screen_name': u'marble',
              u'tweets': 12334}]}

The following aggregation pipeline will will also give you the desired result. In the pipeline, the first stage is the $match operator which filters those documents where the user has got the timezone field value "Brasilia" and has a tweet count (represented by the statuses_count ) greater than or equal to 100 matched via the $gte comparison operator.

The second pipeline stage has the $group operator which groups the filtered documents by the specified identifier expression which is the $user.id field and applies the accumulator expression $max to each group on the $user.followers_count field to get the greatest number of followers for each user. The system variable $$ROOT which references the root document, ie the top-level document, currently being processed in the $group aggregation pipeline stage, is added to an extra array field for use later on. This is achieved by using the $addToSet array operator.

The next pipeline stage $unwinds to output a document for each element in the data array for processing in the next step.

The following pipeline step, $project , then transforms each document in the stream, by adding new fields which have values from the previous stream.

The last two pipeline stages $sort and $limit reorders the document stream by the specified sort key followers and returns one document which contains the user with the highest number of followers.

You final aggregation pipeline thus should look like this:

db.collection.aggregate([
    {
        '$match': { 
            "user.statuses_count": { "$gte": 100 },
            "user.time_zone": "Brasilia"
        }
    },
    {
        "$group": {
            "_id": "$user.id",
            "max_followers": { "$max": "$user.followers_count" },
            "data": { "$addToSet": "$$ROOT" }
        }
    },
    {
        "$unwind": "$data"
    },   
    {
        "$project": {
            "_id": "$data._id",
            "followers": "$max_followers",
            "screen_name": "$data.user.screen_name",
            "tweets": "$data.user.statuses_count"
        }
    }, 
    {
        "$sort": { "followers": -1 }
    },
    {
        "$limit" : 1
    }
])

Executing this in Robomongo gives you the result

/* 0 */
{
    "result" : [ 
        {
            "_id" : ObjectId("52fd2490bac3fa1975477702"),
            "followers" : 2597,
            "screen_name" : "marble",
            "tweets" : 12334
        }
    ],
    "ok" : 1
}

In python, the implementation should be essentially the same:

>>> pipeline = [
...     {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
...     {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROO
T" }}},
...     {"$unwind": "$data"},
...     {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets":
 "$data.user.statuses_count"}},
...     {"$sort": { "followers": -1 }},
...     {"$limit" : 1}
... ]
>>>
>>> for doc in collection.aggregate(pipeline):
...     print(doc)
...
{u'tweets': 12334.0, u'_id': ObjectId('52fd2490bac3fa1975477702'), u'followers': 2597.0, u'screen_name': u'marble'}
>>>

where

pipeline = [
    {"$match": {"user.statuses_count": {"$gte":100 }, "user.time_zone": "Brasilia"}},
    {"$group": {"_id": "$user.id","max_followers": { "$max": "$user.followers_count" },"data": { "$addToSet": "$$ROOT" }}},
    {"$unwind": "$data"},   
    {"$project": {"_id": "$data._id","followers": "$max_followers","screen_name": "$data.user.screen_name","tweets": "$data.user.statuses_count"}}, 
    {"$sort": { "followers": -1 }},
    {"$limit" : 1}
]

python mongodb $match and $group

Question

1 answers

solution1
3 ACCPTED 2015-04-29 11:51:27

python mongodb $match and $group

Question

1 answers

solution1 3 ACCPTED 2015-04-29 11:51:27

solution1
3 ACCPTED 2015-04-29 11:51:27