简体   繁体   English

人气算法

[英]Popularity Algorithm

I'd like to populate the homepage of my user-submitted-illustrations site with the "hottest" illustrations uploaded. 我想在上传“最热门”插图的用户提交插图网站的主页上填写。

Here are the measures I have available: 以下是我可以采取的措施:

  • How many people have favourited that illustration 有多少人喜欢这个插图
    • votes table includes date voted votes表包括投票日期
  • When the illustration was uploaded 上传插图时
    • illustration table has date created illustration表已创建日期
  • Number of comments (not so good as max comments total about 10 at the moment) 评论数量(不如最高评论总数大约10)
    • comments table has comment date comments表有评论日期

I have searched around, but don't want user authority to play a part, but most algorithms include that. 我已经四处寻找,但不希望用户权限发挥作用,但大多数算法都包括这一点。

I also need to find out if it's better to do the calculation in the MySQL that fetches the data or if there should be a PHP/cron method every hour or so. 我还需要了解在MySQL中进行计算以获取数据是否更好,或者每小时应该有一个PHP / cron方法。

I only need 20 illustrations to populate the home page. 我只需要20个插图来填充主页。 I don't need any sort of paging for this data. 我不需要对这些数据进行任何分页。

How do I weight age against votes? 我如何衡量年龄反对选票? Surely a site with less submission needs less weight on date added? 当然,提交较少的网站需要减少日期权重吗?

Many sites that use some type of popularity ranking do so by using a standard algorithm to determine a score and then decaying eternally over time. 许多使用某种类型的流行度排名的网站通过使用标准算法来确定分数然后随着时间的推移而永久衰减。 What I've found works better for sites with less traffic is a multiplier that gives a bonus to new content/activity - it's essentially the same, but the score stops changing after a period of time of your choosing. 我发现对于流量较少的网站来说效果更好的是乘数,可以为新内容/活动提供奖励 - 它基本上是相同的,但是在您选择的一段时间后,分数会停止变化。

For instance, here's a pseudo-example of something you might want to try. 例如,这是您可能想要尝试的一个伪示例。 Of course, you'll want to adjust how much weight you're attributing to each category based on your own experience with your site. 当然,您需要根据自己对网站的体验来调整每个类别的权重。 Comments are rare, but take more effort from the user than a favorite/vote, so they probably should receive more weight. 评论很少见,但是用户需要付出更多努力而不是喜欢/投票,因此他们可能会获得更多的重量。

score = (votes / 10) + comments  
age = UNIX_TIMESTAMP() - UNIX_TIMESTAMP(date_created)

if(age < 86400) score = score * 1.5

This type of approach would give a bonus to new content uploaded in the past day. 这种方法可以为过去一天上传的新内容带来奖励。 If you wanted to approach this in a similar way only for content that had been favorited or commented on recently, you could just add some WHERE constraints on your query that grabs the score out from the DB. 如果您想以类似的方式仅对最近被收藏或评论过的内容进行处理,您可以在查询中添加一些WHERE约束,从数据库中获取分数。

There are actually two big reasons NOT to calculate this ranking on the fly. 实际上有两个很大的原因不能动态计算这个排名。

  1. Requiring your DB to fetch all of that data and do a calculation on every page load just to reorder items results in an expensive query. 要求您的数据库获取所有数据并对每个页面加载进行计算只是为了重新排序项目会导致昂贵的查询。
  2. Probably a smaller gotcha, but if you have a relatively small amount of activity on the site, small changes in the ranking can cause content to move pretty drastically. 可能是一个较小的问题,但如果你在网站上的活动量相对较小,排名的微小变化可能会导致内容大幅度移动。

That leaves you with either caching the results periodically or setting up a cron job to update a new database column holding this score you're ranking by. 这使得您可以定期缓存结果,也可以设置一个cron作业来更新一个新的数据库列,其中包含您正在排名的分数。

Obviously there is some subjectivity in this - there's no one "correct" algorithm for determining the proper balance - but I'd start out with something like votes per unit age. 显然这有一些主观性 - 没有一个“正确”的算法来确定适当的平衡 - 但我会从像单位年龄的投票开始。 MySQL can do basic math so you can ask it to sort by the quotient of votes over time; MySQL可以做基本的数学运算,所以你可以要求它按照投票的商数进行排序; however, for performance reasons, it might be a good idea to cache the result of the query. 但是,出于性能原因,缓存查询结果可能是个好主意。 Maybe something like 也许是这样的

SELECT images.url FROM images ORDER BY (NOW() - images.date) / COUNT((SELECT COUNT(*) FROM votes WHERE votes.image_id = images.id)) DESC LIMIT 20

but my SQL is rusty ;-) 但我的SQL生锈了;-)

Taking a simple average will, of course, bias in favor of new images showing up on the front page. 当然,采用简单的平均值会偏向于首页上显示的新图像。 If you want to remove that bias, you could, say, count only those votes that occurred within a certain time limit after the image being posted. 如果你想删除这种偏见,你可以说,只计算在发布图像后在一定时限内发生的那些投票。 For images that are more recent than that time limit, you'd have to normalize by multiplying the number of votes by the time limit then dividing by the age of the image. 对于比该时间限制更新的图像,您必须通过将投票数乘以时间限然后除以图像的年龄来进行标准化。 Or alternatively, you could give the votes a continuously varying weight, something like exp(-time(vote) + time(image)) . 或者,您可以给予投票不断变化的权重,例如exp(-time(vote) + time(image)) And so on and so on... depending on how particular you are about what this algorithm will do, it could take some experimentation to figure out what formula gives the best results. 依此类推......根据你对该算法的具体要求,可能需要进行一些实验来确定哪种公式可以得到最好的结果。

就实际的算法而言,我没有任何有用的想法,但就实现而言,我建议将结果缓存到某个地方,定期更新 - 如果结果计算导致昂贵的查询,你可能不会想减慢你的响应时间。

Something like: 就像是:

(count favorited + k) * / time since last activity

The higher k is the less weight has the number of people having it favorited. k越高,权重就越少。

You could also change the time to something like the time it first appeared + the time of the last activity, this would ensure that older illustrations would vanish with time. 您还可以将时间更改为首次出现的时间+上次活动的时间,这样可以确保旧插图随着时间的推移而消失。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM