简体   繁体   English

如何提高pymongo查询的性能

[英]How to improve performance of pymongo queries

I inherited an old Mongo database. 我继承了一个旧的Mongo数据库。 Let's focus on the following two collections (removed most of their content for better readability) : 让我们关注以下两个集合(为了更好的可读性,删除了它们的大部分内容)

Collection user 收集用户

db.user.find_one({"email": "user@host.com"})

{'lastUpdate': datetime.datetime(2016, 9, 2, 11, 40, 13, 160000),
 'creationTime': datetime.datetime(2016, 6, 23, 7, 19, 10, 6000),
 '_id': ObjectId('576b8d6ee4b0a37270b742c7'),
 'email': 'user@host.com' }

Collections entry (one user to many entries): 集合条目(一个用户到多个条目):

db.entry.find_one({"userId": _id})

{'date_entered': datetime.datetime(2015, 2, 7, 0, 0),
 'creationTime': datetime.datetime(2015, 2, 8, 14, 41, 50, 701000),
 'lastUpdate': datetime.datetime(2015, 2, 9, 3, 28, 2, 115000),
 '_id': ObjectId('54d775aee4b035e584287a42'),
 'userId': '576b8d6ee4b0a37270b742c7', 
 'data': 'test'}

As you can see, there is no DBRef between the two. 如您所见,两者之间没有DBRef。

What I would like to do is to count the total number of entries, and the number of entries updated after a given date. 我想做的是计算条目总数,以及给定日期后更新的条目数。

To do this I used Python's pymongo library. 为此,我使用了Python的pymongo库。 The code below gets me what I need, but it is painfully slow. 下面的代码为我提供了我所需要的,但是它很慢。

from pymongo import MongoClient
client = MongoClient('mongodb://foobar/')
db = client.userdata

# First I need to fetch all user ids. Otherwise db cursor will time out after some time.
user_ids = []  # build a list of tuples (email, id)
for user in db.user.find():
    user_ids.append( (user['email'], str(user['_id'])) )

date = datetime(2016, 1, 1)
for user_id in user_ids:
    email, _id =  user_id

    t0 = time.time()

    query = {"userId": _id}
    no_of_all_entries = db.entry.find(query).count()

    query = {"userId": _id, "lastUpdate": {"$gte": date}}
    no_of_entries_this_year = db.entry.find(query).count()

    t1 = time.time()
    print("delay ", round(t1 - t0, 2))

    print(email, no_of_all_entries, no_of_entries_this_year)

It takes around 0.83 second to run both db.entry.find queries on my laptop, and 0.54 on an AWS server (not the MongoDB server). 在我的笔记本电脑上运行db.entry.find查询大约需要0.83秒,在AWS服务器(不是MongoDB服务器)上需要0.54秒。

Having ~20000 users it takes painful 3 hours to get all the data. 拥有约20000个用户,要花3个小时才能获取所有数据。 Is that the kind of latency you'd expect to see in Mongo ? 您希望在Mongo中看到这种延迟吗? What can I do to improve this ? 我该如何改善呢? Bear in mind that MongoDB is fairly new to me. 请记住,MongoDB对我来说还很新。

Instead of running two aggregates for all users separately you can just get both aggregates for all users with db.collection.aggregate() . 不必单独为所有用户运行两个聚合,而是可以使用db.collection.aggregate()获得所有用户的两个聚合。

And instead of a (email, userId) tuples we make it a dictionary as it is easier to use to get the corresponding email. 而不是(email, userId)元组,我们将其设置为字典,因为它更容易用于获取相应的电子邮件。

user_emails = {str(user['_id']): user['email'] for user in db.user.find()}

date = datetime(2016, 1, 1)
entry_counts = db.entry.aggregate([
    {"$group": {
        "_id": "$userId",
        "count": {"$sum": 1},
        "count_this_year": {
            "$sum": {
                "$cond": [{"$gte": ["$lastUpdate", date]}, 1, 0]
            }
        }
    }}
])

for entry in entry_counts:
    print(user_emails.get(entry['_id']),
          entry['count'],
          entry['count_this_year'])

I'm pretty sure getting the user's email address into the result could be done but I'm not a mongo expert either. 我很确定可以将用户的电子邮件地址添加到结果中,但是我也不是mongo专家。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM