Django prefetch_related and performance optimisation with multiple QuerySets within loops

Question

Effectively I have multiple Queries within loops that I am just not happy with. I was wondering if someone had the expertise with prefetch_related and other Django Query construction optimisation to be able to help me on this issue.

I start with:

users = User.objects.filter(organisation=organisation).filter(is_active=True)

Then, I start my loop over all days starting from a certain date " start_date ":

for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):

I then within this loop over a filtered subset of the above users

for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
  for user in users.filter(created_date__lte=date).iterator():

Ok, so firstly, is there any way to optimise this?

What may make some of the hardcore Django-ers loose their tether, I do all of the above inside another loop!!

for survey in Survey.objects.all().iterator():
   for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
      for user in users.filter(created_date__lte=date).iterator():

Inside the last loop, I perform one final Query filter:

survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]

I do this because I feel I need to have the user, survey and date variables ready to filter...

I have started thinking about prefetch_related and the Prefetch object. I've consulted the documentation but I can't seem to wrap my head around how I could apply this to my situation.

Effectively, the query is taking far too long. For an average of 1000 users, 4 surveys and approximately 30 days, this query is taking 1 minute to complete.

Ideally, I would like to shave off 50% of this. Any better, and I will be extremely happy. I'd also like the load on the DB server to be reduced as this query could be running multiple times across different organisations.

Any expert tips on how to organise such horrific queries within loops within loops would be greatly appreciated!

Full "condensed" minimum viable snippet:


users = User.objects.filter(organisation=organisation).filter(is_active=True)

datasets = []

for survey in Survey.objects.all():
    data = []
    for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
        total_score = 0
        participants = 0

        for user in users.filter(created_date__lte=date):
             participants += 1
             survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
             total_score += survey_result.score

        # An average is calculated from the total_score and participants and append to the data array.:
        # Divide catches divide by zero errors.
        # Round will round to two decimal places for front end readability.
        data.append(
            round(
                divide(total_score, participants), 2
            )
        )

    datasets.append(data)

********* ADDENDUM: *********

So, further to @dirkgroten's answer I am currently running with:

for survey in Survey.objects.all():

                results = SurveyResult.objects.filter(
                    user__in=users, survey=survey, created_date__range=date_range
                ).values(
                    'survey',
                    'created_date',
                ).annotate(
                    total_score=Sum('normalized_score'),
                    participants=Count('user'),
                    average_score=Avg('normalized_score'),
                ).order_by(
                    'created_date'
                )

                for result in results:
                    print(result)

As I (" think I ") need a breakdown by survey for each QuerySet.

Are there any other optimisations available to me?

Answer 1

You can actually combine queries and perform the calculations directly inside your query:

from django.db.models import Sum, Count, Avg
from django.utils import timezone
users = User.objects.filter(organisation=organisation).filter(is_active=True)
date_range = [start_date, timezone.now().date]  # or adapt end time to different time zone
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)\
              .values('survey', 'created_date')\
              .annotate(total_score=Sum('score'), participants=Count('pk'))
              .order_by('survey', 'created_date')

This will group the results by survey and created_date and add the total_score and participants to each result, something like:

[{'survey': 1, 'created_date': '2019-08-05', 'total_score': 54, 'participants': 20}, 
 {'survey': 1, ... } ... ]

I'm assuming there's only one SurveyResult per user so the number of SurveyResult in each group is the number of participants.

Note that Avg also gives you the average score at once, that is assuming only one possible score per user:

.annotate(average_score=Avg('score'))  # instead of total and participants

This should shave off 99.9% of your query time :-)

If you want the dataset as a list of lists, you just do something like this:

dataset = []
data = []
current_survey = None
current_date = start_date
for result in results
    if not result['survey'] == current_survey:
        # results ordered by survey, so if it changes, reset data
        if data: dataset.append(data)            
        data = []
        current_survey = result['survey']
    if not result['created_date'] == current_date:
        # results ordered by date so missing date won't be there later 
        # assume a daterange function to create a list of dates
        for date in daterange(current_date, result['created_date']):
            data.append(0)  # padding data
     current_date = result['created_date']
     data.append(result['average_score'])

The result will be a list of lists:

dataset = [[0, 0, 10.4, 3.9, 0], [20.2, 3.5, ...], ...]

Not super efficient python, but with a few 1000 results, this will be super fast anyway, way faster than performing more db queries.

EDIT : Since created_date is DateTimeField , you first need to get the corresponding date:

from django.db.models.functions import TruncDate
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)
               .annotate(date=TruncDate('created_date'))
               .values('survey', 'date')
               .annotate(average_score=Avg('score'))
               .order_by('survey', 'date')

Django prefetch_related and performance optimisation with multiple QuerySets within loops

Question

1 answers

solution1
1 ACCPTED 2019-08-06 13:52:57

Django prefetch_related and performance optimisation with multiple QuerySets within loops

Question

1 answers

solution1 1 ACCPTED 2019-08-06 13:52:57

solution1
1 ACCPTED 2019-08-06 13:52:57