在Django中實現流行度算法

Question

我正在創建一個類似於reddit和黑客新聞的網站，其中包含鏈接和投票數據庫。 我正在實施黑客新聞的流行算法，事情正在順利進行，直到實際收集這些鏈接並顯示它們。 算法很簡單：

Y Combinator's Hacker News:
Popularity = (p - 1) / (t + 2)^1.5`

Votes divided by age factor.
Where`

p : votes (points) from users.
t : time since submission in hours.

p is subtracted by 1 to negate submitter's vote.
Age factor is (time since submission in hours plus two) to the power of 1.5.factor is (time since submission in hours plus two) to the power of 1.5.

我在Django中詢問了一個非常類似的關於yonder Complex排序的問題，但是我沒有考慮我的選擇，而是選擇了一個並試圖讓它工作，因為我是用PHP / MySQL做的，但我現在知道Django做的事情有很多不同。

我的模型看起來像這樣（完全）

class Link(models.Model):
category = models.ForeignKey(Category)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
fame = models.PositiveIntegerField(default = 1)
title = models.CharField(max_length = 256)
url = models.URLField(max_length = 2048)

def __unicode__(self):
    return self.title

class Vote(models.Model):
link = models.ForeignKey(Link)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
karma_delta = models.SmallIntegerField()

def __unicode__(self):
    return str(self.karma_delta)

和我的觀點：

def index(request):
popular_links = Link.objects.select_related().annotate(karma_total = Sum('vote__karma_delta'))
return render_to_response('links/index.html', {'links': popular_links})

現在從我之前的問題，我正在嘗試使用排序功能實現該算法。 這個問題的答案似乎認為我應該把算法放在select和sort中。 我打算對這些結果進行分頁，所以我不認為我可以在沒有抓取所有內容的情況下在python中進行排序。 關於如何有效地做到這一點的任何建議？

編輯

這還沒有成功，但我認為這是朝着正確方向邁出的一步：

from django.shortcuts import render_to_response
from linkett.apps.links.models import *

def index(request):
popular_links = Link.objects.select_related()
popular_links = popular_links.extra(
    select = {
        'karma_total': 'SUM(vote.karma_delta)',
        'popularity': '(karma_total - 1) / POW(2, 1.5)',
    },
    order_by = ['-popularity']
)
return render_to_response('links/index.html', {'links': popular_links})

這錯誤到：

Caught an exception while rendering: column "karma_total" does not exist
LINE 1: SELECT ((karma_total - 1) / POW(2, 1.5)) AS "popularity", (S...

編輯2

更好的錯誤？

TemplateSyntaxError: Caught an exception while rendering: missing FROM-clause entry for table "vote"
LINE 1: SELECT ((vote.karma_total - 1) / POW(2, 1.5)) AS "popularity...

我的index.html很簡單：

{% block content %}

{% for link in links %}
 
  
   karma-up
   {{ link.karma_total }}
   karma-down
  
  {{ link.title }}
  Posted by {{ link.user }} to {{ link.category }} at {{ link.created }}
 
{% empty %}
 No Links
{% endfor %}

{% endblock content %}

編輯3非常接近！ 同樣，所有這些答案都很棒，但我專注於一個特定的答案，因為我認為它最適合我的情況。


from django.db.models import Sum
from django.shortcuts import render_to_response
from linkett.apps.links.models import *

def index(request): popular_links = Link.objects.select_related().extra( select = { 'popularity': '(SUM(links_vote.karma_delta) - 1) / POW(2, 1.5)', }, tables = ['links_link', 'links_vote'], order_by = ['-popularity'], ) return render_to_response('links/test.html', {'links': popular_links})

運行這個我有一個錯誤，因為我缺乏按值分組。 特別：


TemplateSyntaxError at /
Caught an exception while rendering: column "links_link.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...karma_delta) - 1) / POW(2, 1.5)) AS "popularity", "links_lin...

不知道為什么我的links_link.id不會在我的小組中，但我不知道如何改變我的小組，django通常會這樣做。

Answer 1

在黑客新聞中，只有210個最新故事和210個最受歡迎的故事被分頁（7頁，每頁30個故事）。 我的猜測是限制的原因（至少部分）是這個問題。

為什么不放棄所有最流行的故事SQL，而只是保留一個運行列表呢？ 一旦你建立了前210個故事的列表，你只需要擔心在新的投票進入時重新排序，因為相對的訂單會隨着時間的推移而保持。 當新的投票進入時，您只需要擔心重新排序獲得投票的故事。

如果收到的投票故事是不是在名單上，計算出故事的成績，再加上最不受歡迎的故事，是在名單上。 如果收到投票的故事較低，那么你就完成了。 如果它更高，則計算第二個到最不受歡迎的故事（故事209）的當前分數並再次進行比較。 繼續努力，直到找到一個得分較高的故事，然后將新投票的故事放在排名的正下方。 當然，除非它達到＃1。

這種方法的好處是它限制了您必須查看的故事集，以找出最佳故事列表。 在絕對最壞的情況下，您必須計算211個故事的排名。 因此，除非您必須從現有數據集建立列表，否則它非常有效 - 但這只是一次性的懲罰，假設您將列表緩存到某個位置。

唐氏投票是另一個問題，但我只能投票（無論如何，在我的業力水平）。

Answer 2

popular_links = Link.objects.select_related()
popular_links = popular_links.extra(
    select = {
        'karma_total': 'SUM(vote.karma_delta)',
        'popularity': '(karma_total - 1) / POW(2, 1.5)'
    },
    order_by = ['-popularity']
)

或者選擇一些合理的數字，以任何你喜歡的方式使用python對選擇進行排序，並緩存它是否對所有用戶來說都是靜態的 - 看起來它會 - 將緩存到期時間設置為一分鍾左右。

但是extra可以更好地用於高動態設置中的分頁結果。

Answer 3

好像你可以重載Vote類的save並讓它更新相應的Link對象。 像這樣的東西應該運作良好：

from datetime import datetime, timedelta

class Link(models.Model):
 category = models.ForeignKey(Category)
 user = models.ForeignKey(User)
 created = models.DateTimeField(auto_now_add = True)
 modified = models.DateTimeField(auto_now = True)
 fame = models.PositiveIntegerField(default = 1)
 title = models.CharField(max_length = 256)
 url = models.URLField(max_length = 2048)

 #a field to keep the most recently calculated popularity
 popularity = models.FloatField(default = None)

 def CalculatePopularity(self):
  """
  Add a shorcut to make life easier ... this is used by the overloaded save() method and 
  can be used in a management function to do a mass-update periodically
  """
  ts = datetime.now()-self.created
  th = ts.seconds/60/60
  self.popularity = (self.user_set.count()-1)/((th+2)**1.5)

 def save(self, *args, **kwargs):
  """
  Modify the save function to calculate the popularity
  """
  self.CalculatePopularity()
  super(Link, self).save(*args, **kwargs)


 def __unicode__(self):
     return self.title

class Vote(models.Model):
 link = models.ForeignKey(Link)
 user = models.ForeignKey(User)
 created = models.DateTimeField(auto_now_add = True)
 modified = models.DateTimeField(auto_now = True)
 karma_delta = models.SmallIntegerField()

 def save(self, *args, **kwargs):
  """
  Modify the save function to calculate the popularity of the Link object
  """
  self.link.CalculatePopularity()
  super(Vote, self).save(*args, **kwargs)

 def __unicode__(self):
     return str(self.karma_delta)

這樣每次調用link_o.save（）或vote_o.save（）時，它都會重新計算流行度。 你必須要小心，因為當你調用Link.objects.all().update('updating something')它就不會調用我們重載的save()函數。 因此，當我使用這種東西時，我創建了一個管理命令，它可以更新所有對象，因此它們不會過時。 像這樣的東西會很有效：

from itertools import imap
imap(lambda x:x.CalculatePopularity(), Link.objects.all().select_related().iterator())

這樣它只會立即將一個Link對象加載到內存中...所以如果你有一個巨大的數據庫，它不會導致內存錯誤。

現在要做你的排名，你所要做的就是：

Link.objects.all().order_by('-popularity')

它會超級快，因為你所有的鏈接項已經計算了流行度。

Answer 4

這是我的問題的最終答案，雖然已經很晚幾個月，而不是我的想法。 希望它對某些人有用。

def hot(request):
    links = Link.objects.select_related().annotate(votes=Count('vote')).order_by('-created')[:150]
    for link in links:
        delta_in_hours = (int(datetime.now().strftime("%s")) - int(link.created.strftime("%s"))) / 3600
        link.popularity = ((link.votes - 1) / (delta_in_hours + 2)**1.5)

    links = sorted(links, key=lambda x: x.popularity, reverse=True)

    links = paginate(request, links, 5)

    return direct_to_template(
        request,
        template = 'links/link_list.html',
        extra_context = {
            'links': links
        })

這里發生了什么是我提取最新的150份提交（每頁30個鏈接5頁），如果你需要更多，你可以通過改變我的切片[:150] 。 這樣我就不必迭代我的查詢集，這可能最終會變得非常大，真的150個鏈接應該足以讓任何人拖延。

然后我計算從現在到創建鏈接之間的時間差，並將其轉換為幾小時（不像我想象的那么簡單）

將算法應用於不存在的字段（我喜歡這種方法，因為我不必將值存儲在我的數據庫中，並且不依賴於周圍的鏈接。

在for循環之后的那一行是我還有一點麻煩的地方。 我不能order_by('popularity')因為它不是我的數據庫中的真實字段，而是在運行中計算，所以我必須將我的查詢集轉換為對象列表並從那里排序流行度。

下一行只是我的paginator快捷方式，幸好分頁不需要查詢集，不像一些通用視圖（與你對話object_list）。

把所有東西吐出一個漂亮的direct_to_template通用視圖，並以我的快樂方式。

在Django中實現流行度算法

問題描述

4 個解決方案

解決方案1
9 已采納 2009-12-27 07:45:45

解決方案2
4 2009-12-27 08:41:45

解決方案3
4 2009-12-27 18:37:29

解決方案4
1 2010-05-11 00:46:22

在Django中實現流行度算法

問題描述

4 個解決方案

解決方案1 9 已采納 2009-12-27 07:45:45

解決方案2 4 2009-12-27 08:41:45

解決方案3 4 2009-12-27 18:37:29

解決方案4 1 2010-05-11 00:46:22

解決方案1
9 已采納 2009-12-27 07:45:45

解決方案2
4 2009-12-27 08:41:45

解決方案3
4 2009-12-27 18:37:29

解決方案4
1 2010-05-11 00:46:22