Popularity Algorithm

Question

I'm making a digg-like website that is going to have a homepage with different categories. I want to display the most popular submissions.

Our rating system is simply "likes", like "I like this" and whatnot. We basically want to display the submissions with the highest number of "likes" per time. We want to have three categories: all-time popularity, last week, and last day.

Does anybody know of a way to help? I have no idea how to go about doing this and making it efficient. I thought that we could use some sort of cron-job to run every 10 minutes and pull in the number of likes per the last 10 minutes...but I've been told that's pretty inefficient?

Help?

Thanks!

Answer 1

Typically Digg and Reddit-like sites go by the date of the submission and not the times of the votes. This way all it takes is a simple SQL query to find the top submissions for X time period. Here's a pseudo-query to find the 10 most popular links from the past 24 hours using this method:

select * from submissions
 where (current_time - post_time) < 86400
 order by score desc limit 10

Basically, this query says to find all the submissions where the number of seconds between now and the time it was posted is less than 86400, which is 24 hours in UNIX time.

If you really want to measure popularity within X time interval, you'll need to store the post and time for every vote in another table:

create table votes (
 post foreign key references submissions(id),
 time datetime,
 vote integer); -- +1 for upvote, -1 for downvote

Then you can generate a list of the most popular posts between X and Y times like so:

select sum(vote), post from votes
 where X < time and time < Y
 group by post
 order by sum(vote) desc limit 10;

From here you're just a hop, skip, and inner join away from getting the post data tied to the returned ids.

Answer 2

Do you have a decent DB setup? Can we please hear about your CREATE TABLE details and indices? Assuming a sane setup, the DB should be able to pull the counts you require fast enough to suit your needs! For example (net of indices and keys, that somewhat depend on what DB engine you're using), given two tables:

CREATE TABLE submissions (subid INT, when DATETIME, etc etc)
CREATE TABLE likes (subid INT, when DATETIME, etc etc)

you can get the top 33 all-time popular submissions as

SELECT *, COUNT(likes.subid) AS score
FROM submissions
JOIN likes USING(subid)
GROUP BY submissions.subid
ORDER BY COUNT(likes.subid) DESC
LIMIT 33

and those voted for within a certain time range as

SELECT *, COUNT(likes.subid) AS score
FROM submissions
JOIN likes USING(subid)
WHERE likes.when BETWEEN initial_time AND final_time
GROUP BY submissions.subid
ORDER BY COUNT(likes.subid) DESC
LIMIT 33

If you were storing "votes" (positive or negative) in likes , instead of just counting each entry there as +1 , you could simply use SUM(likes.vote) instead of the COUNT s.

Answer 3

For stable list like alltime, lastweek, because they are not supposed to change really fast so that I think you should save the list in your cache with expiration time is around 1 days or longer.

If you concern about correct count in real time, you can check at every page view by comparing the page with lowest page in the cache.

All you need to do is to care for synchronizing between the cache and actual database.

thethanghn

Answer 4

Queries where the order is some function of the current time can become real performance problems. Things get much simpler if you can bucket by calendar time and update scores for each bucket as people vote.

Answer 5

为了完成nobody_的答案，我建议您阅读文档（如果您当然使用MySQL）。

Popularity Algorithm

Question

5 answers

solution1
9 2009-06-22 04:27:42

solution2
3 2009-06-22 04:34:43

solution3
0 2009-06-22 04:41:14

solution4
0 2009-06-22 22:12:13

solution5
-1 2009-06-22 04:31:19

Popularity Algorithm

Question

5 answers

solution1 9 2009-06-22 04:27:42

solution2 3 2009-06-22 04:34:43

solution3 0 2009-06-22 04:41:14

solution4 0 2009-06-22 22:12:13

solution5 -1 2009-06-22 04:31:19

solution1
9 2009-06-22 04:27:42

solution2
3 2009-06-22 04:34:43

solution3
0 2009-06-22 04:41:14

solution4
0 2009-06-22 22:12:13

solution5
-1 2009-06-22 04:31:19