简体   繁体   English

在Django中加快大型QuerySet的迭代速度

[英]Speeding up iteration over a large QuerySet in Django

I'm trying my hand at the Riot API challenge , and I'm trying to use Django as a backend hosted on PythonAnywhere.com . 我正在尝试应对Riot API挑战 ,并且正在尝试使用Django作为PythonAnywhere.com上托管的后端。

I have setup a database which uses a structure similar to the one below 我已经建立了一个数据库,该数据库使用类似于下面的结构

class MatchDetails(models.Model):
    # Data fields

class Participant(models.Model):
    match = models.ForeignKey(MatchDetails)
    # Data fields

class Timeline(models.Model):
    participant = models.ForeignKey(Participant)
    # Data fields

# More fields, most with MatchDetails as foreign key.

I have written a function which retrieves and stores data, and I now have almost 40 000 games stored, with 10 participants in each game. 我编写了一个检索和存储数据的函数,现在我已存储了将近40 000个游戏,每个游戏有10个参与者。 My goal is to extract some statistics from this data, and I basically do something like: 我的目标是从这些数据中提取一些统计信息,而我基本上要做的是:

allMatches = MatchDetails.objects.all()
for m in allMatches:
    participants = m.participant_set.all()
    for p in participants:
        # Increment some values
# save the result to the database

Currently it takes a little more than 2 hours. 目前,这个过程要花2个多小时。

2015-04-11 03:47:35 -- Completed task, took 7942.00 seconds, return code was 0. 2015-04-11 03:47:35-完成的任务,花费了7942.00秒,返回码为0。

This is a ridiculous amount of time, isn't it? 这是一个荒谬的时间,不是吗? Is there some way for me to speed it up? 有什么办法可以加快速度吗?

I've tried to use iterator, and I also tried using .value_list and .all.values() to iterate over, but I am unable to get objects related through foreign key this way. 我尝试使用迭代器,也尝试使用.value_list和.all.values()进行迭代,但是无法以这种方式通过外键获取相关的对象。

How do I speed up iteration of a large dataset in Django 如何在Django中加快大型数据集的迭代

Is there any way for me to access my foreign key objects when using value_list? 使用value_list时,有什么方法可以访问外键对象? Or is there anything else I can do to speed it up? 还是我可以采取其他措施来加快速度? Any pointers would be appreciated. 任何指针将不胜感激。

Thanks for reading! 谢谢阅读!

knbk's answer is great. knbk的答案很好。 You could also do your counting in the database. 您也可以在数据库中进行计数。 For instance, if you had a field on the participant model for time spent playing and you wanted the average time that participants spent playing, you could use something like 例如,如果您在参与者模型上有一个字段用于花费的游戏时间,并且您想要参与者花费的平均时间,则可以使用类似

Participant.objects.all().aggregate(Avg('time_spent_playing'))

Have a look at the Django aggregation docs for more info. 请查看Django聚合文档以获取更多信息。

The best optimization at this point is to use prefetch_related() : 此时最好的优化是使用prefetch_related()

allMatches = MatchDetails.objects.prefetch_related('participant_set')
for m in allMatches:
    for p in m.participant_set.all():
        # Increment some values
# save the result to the database

This reduces your number of queries from about 40 000 to 2. 这样可以将查询数量从大约40 000个减少到2个。

You can try to speed it up using prefetch related . 您可以尝试使用prefetch related来加快速度。

Also, in your values_list, you can get just the properties you need of your related objects traversing like "foreign_key_relation_name__attribute". 另外,在您的values_list中,您可以仅获得遍历相关对象所需的属性,例如“ foreign_key_relation_name__attribute”。

Using "iterator()" is also a good way of improving speed if you only iterate once through the queryset. 如果仅通过查询集迭代一次,则使用“ iterator()”也是提高速度的好方法。

How is your "save the result to the database" code? 您的“将结果保存到数据库”代码如何? If you save all items in batch with update() instead of saving each one by one, you will improve your speed as well. 如果您使用update()批量保存所有项目,而不是一一保存,则速度也会提高。

Depending on two factors - what it is you're updating, and if you're running Django 1.8 - you don't have to iterate through everything. 根据两个因素-您要更新的内容以及运行的是Django 1.8-您不必遍历所有内容。

from django.db.models import F
m.participant_set.update(some_field=F('some_field')*10)

This would update all some_field in the participants model to be its current value times ten which would be orders of magnitude faster than iterating over all the rows and doing an update per row. 这会将参与者模型中的所有some_field更新为其当前值乘以十,这比在所有行上进行迭代并每行进行更新要快some_field数量级。

Worth having in mind is that if you have overridden the Participant.save()-method, it won't be called, and save signals wont be sent either. 值得记住的是,如果您重写了Participant.save()方法,则不会调用该方法,也不会发送保存信号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM