How to optimize this function for returning posts in a feed. Django

Question

Hello I made a function that will return posts based on when they were created and ranked by the number of votes they have. This is for a feed. I'm not sure that if this is an efficient way to do this and it's important that I get it right because this could potentially be sifting through thousands of posts. I have simplified it some here and briefly explained each step. My current concerns and a post of the whole function (without annotations) are after.

The start hours argument is for how many hours ago to get posts. For example, if startHours=6, than only posts created after six hours ago will be returned.

def rank(request, startHours):

First I order all of the posts by votes

    unorderedPosts=Post.objects.order_by('-votes')

Then posts are excluded by categories the user has specified

    if request.user.is_authenticated(): 
        preferences=request.user.categorypreference_set.filter(on=False)
        for preference in preferences:
                unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)

Then I make endHours, which is always 4 hours ahead of startHours argument

    endHours=startHours+4         #4 hour time window

Now I slice the unorderedPosts and get only the ones created in the time window startHours to endHours from now. For example, if startHours=4, than only posts created after 4 hours ago but before 8 hours ago will be returned.

    posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))

    posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))

Now I make a loop that will move the time window back until at least one post is found that has a creation date that fits the time window. I make the check variable to prevent an infinite loop (the loop will quit if no posts are found within 200 hours).

    count=posts.count()
    check=endHours
    while count<1 and endHours<(check+200):
        endHours+=4
        startHours+=4
        posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))
        posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))
        count=posts.count()
        if count>=1: return posts, endHours

    return posts

My biggest concern is making a queryset of ALL the posts in the beginning. This function is meant to return posts in small time windows, will it be unnecessarily slowed down by ranking all of the posts in the database? I know that django/python querysets are quite efficient but won't ranking a set that contains thousands of objects be cumbersome for the purpose of this function?

If this is a problem, how could I make it more efficient while keeping everything accessible?

Here is a post of the whole thing.

def rank(request, startHours): 

    unorderedPosts=Post.objects.order_by('-upVote')

    if request.user.is_authenticated(): 
        preferences=request.user.categorypreference_set.filter(on=False)
        for preference in preferences: #filter by category preference
                unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)


    endHours=startHours+4     #4 hour time window

    posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))

    posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))

    count=posts.count()
    check=endHours

    while count<1 and endHours<(check+200):
        endHours+=4
        startHours+=4
        posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))
        posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))
        count=posts.count()
        if count>=1: return posts

   return posts

Answer 1

Your main concern is not something you need to worry about. Check the docs on When querysets are evaluated - you can define and add clauses to a queryset indefinitely and it won't actually be run against the database until you call something that actually requires hitting the database.

What will require multiple queries is iterating through time until you hit a window that has posts. You'll have better performance if you check the latest created time in one call, use that to work out your window, then limit the posts based on that and then sort by number of votes.

Something like:

unorderedPosts = Post.objects.all()
if request.user.is_authenticated(): 
    preferences=request.user.categorypreference_set.filter(on=False)
    for preference in preferences: #filter by category preference
        unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)
latest_post_datetime = unorderedPosts.aggregate(Max('created'))['created__max']

original_start_time = datetime.datetime.now() - datetime.timedelta(hours=startHours)    
latest_post_day_start_time = datetime.datetime.combine(latest_post_datetime.date(), original_start_time.time())
# a timedelta guaranteed to be less than 24 hours
time_shift = latest_post_day_start_time - latest_post_datetime
timewindow = datetime.timedelta(hours=4)
if time_shift.days >= 0:
    extra_windows_needed = time_shift.seconds / timewindow.seconds 
else:
    # negative timedeltas store negative days, then positive seconds; negate
    extra_windows_needed = -(abs(time_shift).seconds) / timewindow.seconds
start_time = latest_post_day_start_time - (timewindow * (extra_windows_needed + 1))
posts = unorderedPosts.filter(created__gte=start_time).order_by('-upVote')
return posts

The math here is only right as long as your number of hours in your window (4) divides evenly into the day - otherwise calculating the correct offset gets trickier. Basically, you need to take time offset mod time window length, and I'm exploiting the fact that if you end up in the same calendar day I know the days mod four hours part works out.

Also, it doesn't include an end time because your original logic doesn't enforce one for the initial startHours period. It'll only move the start time back out of that if there are none within it, so you don't need to worry about stuff that's too recent showing up.

This version makes at most three DB queries - one to get the category preferences for a logged in user, one to get latest_post_datetime and one to get posts , with confidence of having at least one matching post.

You might also consider profiling to see if your DB back end does better with a subquery to exclude unwanted categories:

if request.user.is_authenticated():
    unorderedPosts = unorderedPosts.exclude(category_name__in=request.user.categorypreference_set.filter(on=False).values_list('category_name')

As the __in lookup docs note, performance here varies with different database back ends.

How to optimize this function for returning posts in a feed. Django

Question

1 answers

solution1
1 ACCPTED 2013-07-10 22:43:26

How to optimize this function for returning posts in a feed. Django

Question

1 answers

solution1 1 ACCPTED 2013-07-10 22:43:26

solution1
1 ACCPTED 2013-07-10 22:43:26