如何提高pymongo查询的性能

Question

I inherited an old Mongo database. 我继承了一个旧的Mongo数据库。 Let's focus on the following two collections (removed most of their content for better readability) : 让我们关注以下两个集合（为了更好的可读性，删除了它们的大部分内容） ：

Collection user 收集用户

db.user.find_one({"email": "user@host.com"})

{'lastUpdate': datetime.datetime(2016, 9, 2, 11, 40, 13, 160000),
 'creationTime': datetime.datetime(2016, 6, 23, 7, 19, 10, 6000),
 '_id': ObjectId('576b8d6ee4b0a37270b742c7'),
 'email': 'user@host.com' }

Collections entry (one user to many entries): 集合条目（一个用户到多个条目）：

db.entry.find_one({"userId": _id})

{'date_entered': datetime.datetime(2015, 2, 7, 0, 0),
 'creationTime': datetime.datetime(2015, 2, 8, 14, 41, 50, 701000),
 'lastUpdate': datetime.datetime(2015, 2, 9, 3, 28, 2, 115000),
 '_id': ObjectId('54d775aee4b035e584287a42'),
 'userId': '576b8d6ee4b0a37270b742c7', 
 'data': 'test'}

As you can see, there is no DBRef between the two. 如您所见，两者之间没有DBRef。

What I would like to do is to count the total number of entries, and the number of entries updated after a given date. 我想做的是计算条目总数，以及给定日期后更新的条目数。

To do this I used Python's pymongo library. 为此，我使用了Python的pymongo库。 The code below gets me what I need, but it is painfully slow. 下面的代码为我提供了我所需要的，但是它很慢。

from pymongo import MongoClient
client = MongoClient('mongodb://foobar/')
db = client.userdata

# First I need to fetch all user ids. Otherwise db cursor will time out after some time.
user_ids = []  # build a list of tuples (email, id)
for user in db.user.find():
    user_ids.append( (user['email'], str(user['_id'])) )

date = datetime(2016, 1, 1)
for user_id in user_ids:
    email, _id =  user_id

    t0 = time.time()

    query = {"userId": _id}
    no_of_all_entries = db.entry.find(query).count()

    query = {"userId": _id, "lastUpdate": {"$gte": date}}
    no_of_entries_this_year = db.entry.find(query).count()

    t1 = time.time()
    print("delay ", round(t1 - t0, 2))

    print(email, no_of_all_entries, no_of_entries_this_year)

It takes around 0.83 second to run both db.entry.find queries on my laptop, and 0.54 on an AWS server (not the MongoDB server). 在我的笔记本电脑上运行db.entry.find查询大约需要0.83秒，在AWS服务器（不是MongoDB服务器）上需要0.54秒。

Having ~20000 users it takes painful 3 hours to get all the data. 拥有约20000个用户，要花3个小时才能获取所有数据。 Is that the kind of latency you'd expect to see in Mongo ? 您希望在Mongo中看到这种延迟吗？ What can I do to improve this ? 我该如何改善呢？ Bear in mind that MongoDB is fairly new to me. 请记住，MongoDB对我来说还很新。

Answer 1

Instead of running two aggregates for all users separately you can just get both aggregates for all users with db.collection.aggregate() . 不必单独为所有用户运行两个聚合，而是可以使用db.collection.aggregate()获得所有用户的两个聚合。

And instead of a (email, userId) tuples we make it a dictionary as it is easier to use to get the corresponding email. 而不是(email, userId)元组，我们将其设置为字典，因为它更容易用于获取相应的电子邮件。

user_emails = {str(user['_id']): user['email'] for user in db.user.find()}

date = datetime(2016, 1, 1)
entry_counts = db.entry.aggregate([
    {"$group": {
        "_id": "$userId",
        "count": {"$sum": 1},
        "count_this_year": {
            "$sum": {
                "$cond": [{"$gte": ["$lastUpdate", date]}, 1, 0]
            }
        }
    }}
])

for entry in entry_counts:
    print(user_emails.get(entry['_id']),
          entry['count'],
          entry['count_this_year'])

I'm pretty sure getting the user's email address into the result could be done but I'm not a mongo expert either. 我很确定可以将用户的电子邮件地址添加到结果中，但是我也不是mongo专家。

如何提高pymongo查询的性能

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-09-14 07:46:23

如何提高pymongo查询的性能

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-09-14 07:46:23

解决方案1
2 已采纳 2016-09-14 07:46:23