简体   繁体   English

如何优化此功能以返回提要中的帖子。 Django的

[英]How to optimize this function for returning posts in a feed. Django

Hello I made a function that will return posts based on when they were created and ranked by the number of votes they have. 您好,我做了一个函数,该函数将根据帖子的创建时间和帖子的投票数来对其进行排序。 This is for a feed. 这是供稿。 I'm not sure that if this is an efficient way to do this and it's important that I get it right because this could potentially be sifting through thousands of posts. 我不确定这是否是一种有效的方法,但我弄对了这一点很重要,因为这可能会筛选成千上万的帖子。 I have simplified it some here and briefly explained each step. 我在这里已对其进行了简化,并简要说明了每个步骤。 My current concerns and a post of the whole function (without annotations) are after. 我当前的担忧和整个功能的发布(不带注释)在后面。

The start hours argument is for how many hours ago to get posts. start hours参数是多少小时前才能获得帖子。 For example, if startHours=6, than only posts created after six hours ago will be returned. 例如,如果startHours = 6,则仅返回六个小时前创建的帖子。

def rank(request, startHours): 

First I order all of the posts by votes 首先,我以投票方式对所有帖子进行排序

    unorderedPosts=Post.objects.order_by('-votes')

Then posts are excluded by categories the user has specified 然后按用户指定的类别排除帖子

    if request.user.is_authenticated(): 
        preferences=request.user.categorypreference_set.filter(on=False)
        for preference in preferences:
                unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)

Then I make endHours, which is always 4 hours ahead of startHours argument 然后我进行endHours,它总是比startHours参数提前4小时

    endHours=startHours+4         #4 hour time window

Now I slice the unorderedPosts and get only the ones created in the time window startHours to endHours from now. 现在,我对无序的帖子进行切片,然后仅从现在开始在时间窗口startHours至endHours中创建的帖子。 For example, if startHours=4, than only posts created after 4 hours ago but before 8 hours ago will be returned. 例如,如果startHours = 4,则仅返回4小时之前但8小时之前创建的帖子。

    posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))

    posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))

Now I make a loop that will move the time window back until at least one post is found that has a creation date that fits the time window. 现在,我做一个循环,将时间窗口移回,直到找到至少一个帖子的创建日期适合该时间窗口为止。 I make the check variable to prevent an infinite loop (the loop will quit if no posts are found within 200 hours). 我将check变量设置为防止无限循环(如果200小时内未找到任何帖子,则循环将退出)。

    count=posts.count()
    check=endHours
    while count<1 and endHours<(check+200):
        endHours+=4
        startHours+=4
        posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))
        posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))
        count=posts.count()
        if count>=1: return posts, endHours

    return posts

My biggest concern is making a queryset of ALL the posts in the beginning. 我最担心的是在开始时对所有帖子进行查询。 This function is meant to return posts in small time windows, will it be unnecessarily slowed down by ranking all of the posts in the database? 此功能旨在在较小的时间范围内返回帖子,是否会通过对数据库中的所有帖子进行排名而不必要地降低速度? I know that django/python querysets are quite efficient but won't ranking a set that contains thousands of objects be cumbersome for the purpose of this function? 我知道django / python查询集非常有效,但是对于此功能而言,对包含数千个对象的集进行排名不会很麻烦吗?

If this is a problem, how could I make it more efficient while keeping everything accessible? 如果这是一个问题,我如何在保持所有内容可访问性的同时提高效率?

Here is a post of the whole thing. 这是全部内容。

def rank(request, startHours): 

    unorderedPosts=Post.objects.order_by('-upVote')

    if request.user.is_authenticated(): 
        preferences=request.user.categorypreference_set.filter(on=False)
        for preference in preferences: #filter by category preference
                unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)


    endHours=startHours+4     #4 hour time window

    posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))

    posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))

    count=posts.count()
    check=endHours

    while count<1 and endHours<(check+200):
        endHours+=4
        startHours+=4
        posts=unorderedPosts.filter(created__gte=(timezone.now()-datetime.timedelta(hours=endHours)))
        posts=posts.exclude(created__gte=(timezone.now()-datetime.timedelta(hours=startHours)))
        count=posts.count()
        if count>=1: return posts

   return posts

Your main concern is not something you need to worry about. 您主要不必担心的事情。 Check the docs on When querysets are evaluated - you can define and add clauses to a queryset indefinitely and it won't actually be run against the database until you call something that actually requires hitting the database. 查看有关何时评估查询集的文档-您可以无限期地定义查询并将其添加到查询集中,直到您调用实际需要访问数据库的内容时,它才真正针对数据库运行。

What will require multiple queries is iterating through time until you hit a window that has posts. 需要多次查询的过程会随着时间的流逝而反复进行,直到您单击包含帖子的窗口。 You'll have better performance if you check the latest created time in one call, use that to work out your window, then limit the posts based on that and then sort by number of votes. 如果您在一个呼叫中检查最近created时间,使用该时间来计算窗口时间,然后根据该时间限制帖子数量,然后按投票数排序,将会有更好的性能。

Something like: 就像是:

unorderedPosts = Post.objects.all()
if request.user.is_authenticated(): 
    preferences=request.user.categorypreference_set.filter(on=False)
    for preference in preferences: #filter by category preference
        unorderedPosts=unorderedPosts.exclude(category_name=preference.category_name)
latest_post_datetime = unorderedPosts.aggregate(Max('created'))['created__max']

original_start_time = datetime.datetime.now() - datetime.timedelta(hours=startHours)    
latest_post_day_start_time = datetime.datetime.combine(latest_post_datetime.date(), original_start_time.time())
# a timedelta guaranteed to be less than 24 hours
time_shift = latest_post_day_start_time - latest_post_datetime
timewindow = datetime.timedelta(hours=4)
if time_shift.days >= 0:
    extra_windows_needed = time_shift.seconds / timewindow.seconds 
else:
    # negative timedeltas store negative days, then positive seconds; negate
    extra_windows_needed = -(abs(time_shift).seconds) / timewindow.seconds
start_time = latest_post_day_start_time - (timewindow * (extra_windows_needed + 1))
posts = unorderedPosts.filter(created__gte=start_time).order_by('-upVote')
return posts

The math here is only right as long as your number of hours in your window (4) divides evenly into the day - otherwise calculating the correct offset gets trickier. 只要您在窗口(4)中的小时数平均分配到一天中,这里的数学就正确了-否则计算正确的偏移量会变得更加棘手。 Basically, you need to take time offset mod time window length, and I'm exploiting the fact that if you end up in the same calendar day I know the days mod four hours part works out. 基本上,您需要获取时间偏移量mod时间窗口的长度,并且我正在利用这样一个事实,即如果您在同一日历日结束,则我知道可以修改mods的四个小时。

Also, it doesn't include an end time because your original logic doesn't enforce one for the initial startHours period. 另外,它不包含结束时间,因为您的原始逻辑在初始startHours期间不执行startHours It'll only move the start time back out of that if there are none within it, so you don't need to worry about stuff that's too recent showing up. 如果其中没有启动时间,它只会将开始时间移回原来的位置,因此您不必担心出现的时间太新。

This version makes at most three DB queries - one to get the category preferences for a logged in user, one to get latest_post_datetime and one to get posts , with confidence of having at least one matching post. 此版本最多可以进行三个数据库查询-一个用于获取登录用户的类别首选项,一个用于获取latest_post_datetime ,一个用于获取posts ,并确信至少有一个匹配的帖子。

You might also consider profiling to see if your DB back end does better with a subquery to exclude unwanted categories: 您还可以考虑进行分析,以查看您的数据库后端是否可以通过子查询来更好地排除不需要的类别:

if request.user.is_authenticated():
    unorderedPosts = unorderedPosts.exclude(category_name__in=request.user.categorypreference_set.filter(on=False).values_list('category_name')

As the __in lookup docs note, performance here varies with different database back ends. 正如__in查找文档中所述 ,此处的性能随数据库后端的不同而不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM