Optimizing queries for content popularity by hits

Question

I've done some searching for this but haven't come up with anything, maybe someone could point me in the right direction.
I have a website with lots of content in a MySQL database and a PHP script that loads the most popular content by hits. It does this by logging each content hit in a table along with the access time. Then a select query is run to find the most popular content in the past 24 hours, 7 day or maximum 30 days. A cronjob deletes anything older than 30 days in the log table.

The problem I'm facing now is as the website grows the log table has 1m+ hit records and it is really slowing down my select query (10-20s). At first I though the problem was a join I had in the query to get the content title, url, etc. But now I'm not sure as in test removing the join does not speed the query as much as I though it would.

So my question is what is best practise of doing this kind of popularity storing/selecting? Are they any good open source scripts for this? Or what would you suggest?

Table scheme

"popularity" hit log table
nid | insert_time | tid
nid: Node ID of the content
insert_time: timestamp (2011-06-02 04:08:45)
tid: Term/category ID

"node" content table
nid | title | status | (there are more but these are the important ones)
nid: Node ID
title: content title
status: is the content published (0=false, 1=true)

SQL

SELECT node.nid, node.title, COUNT(popularity.nid) AS count  
FROM `node` INNER JOIN `popularity` USING (nid)  
WHERE node.status = 1  
  AND  popularity.insert_time >= DATE_SUB(CURDATE(),INTERVAL 7 DAY)  
GROUP BY popularity.nid  
ORDER BY count DESC  
LIMIT 10;

Answer 1

We've just come across a similar situation and this is how we got around it. We decided we didn't really care about what exact 'time' something happened, only the day it happened on. We then did this:

Every record has a 'total hits' record which is incremented every time something happens
A logs table records these 'total hits' per record, per day (in a cron job)
By selecting the difference between two given dates in this log table, we can deduce the 'hits' between two dates, very quickly.

The advantage of this is the size of your log table is only as big as NumRecords * NumDays which in our case is very small. Also any queries on this logs table are very quick.

The disadvantage is you lose the ability to deduce hits by time of day but if you don't need this then it might be worth considering.

Answer 2

You actually have two problems to solve further down the road.

One, which you've yet to run into but which you might earlier than you want, is going to be insert throughput within your stats table.

The other, which you've outlined in your question, is actually using the stats.

Let's start with input throughput.

Firstly, in case you're doing so, don't track statistics on pages that could use caching. Use a php script that advertises itself as an empty javascript, or as a one-pixel image, and include the latter on pages you're tracking. Doing so allows to readily cache the remaining content of your site.

In a telco business, rather than doing an actual inserts related to billing on phone calls, things are placed in memory and periodically sync'ed with the disk. Doing so allows to manage gigantic throughputs while keeping the hard-drives happy.

To proceed similarly on your end, you'll need an atomic operation and some in-memory storage. Here's some memcache-based pseudo-code for doing the first part...

For each page, you need a Memcache variable. In Memcache, increment() is atomic, but add(), set(), and so forth aren't. So you need to be wary of not miss-counting hits when concurrent processes add the same page at the same time:

$ns = $memcache->get('stats-namespace');
while (!$memcache->increment("stats-$ns-$page_id")) {
  $memcache->add("stats-$ns-$page_id", 0, 1800); // garbage collect in 30 minutes
  $db->upsert('needs_stats_refresh', array($ns, $page_id)); // engine = memory
}

Periodically, say every 5 minutes (configure the timeout accordingly), you'll want to sync all of this to the database, without any possibility of concurrent processes affecting each other or existing hit counts. For this, you increment the namespace before doing anything (this gives you a lock on existing data for all intents and purposes), and sleep a bit so that existing processes that reference the prior namespace finish up if needed:

$ns = $memcache->get('stats-namespace');
$memcache->increment('stats-namespace');
sleep(60); // allow concurrent page loads to finish

Once that is done, you can safely loop through your page ids, update stats accordingly, and clean up the needs_stats_refresh table. The latter only needs two fields: page_id int pkey, ns_id int). There's a bit more to it than simple select, insert, update and delete statements run from your scripts, however, so continuing...

As another replier suggested, it's quite appropriate to maintain intermediate stats for your purpose: store batches of hits rather than individual hits. At the very most, I'm assuming you want hourly stats or quarter-hourly stats, so it's fine to deal with subtotals that are batch-loaded every 15 minute.

Even more importantly for your sake, since you're ordering posts using these totals, you want to store the aggregated totals and have an index on the latter. (We'll get to where further down.)

One way to maintain the totals is to add a trigger which, on insert or update to the stats table, will adjust the stats total as needed.

When doing so, be especially wary about dead-locks. While no two $ns runs will be mixing their respective stats, there is still a (however slim) possibility that two or more processes fire up the "increment $ns" step described above concurrently, and subsequently issue statements that seek to update the counts concurrently. Obtaining an advisory lock is the simplest, safest, and fastest way to avoid problems related to this.

Assuming you use an advisory lock, it's perfectly OK to use: total = total + subtotal in the update the statement.

While on the topic of locks, note that updating the totals will require an exclusive lock on each affected row. Since you're ordering by them, you don't want them processed all in one go because it might mean keeping an exclusive lock for an extended duration. The simplest here is to process the inserts into stats in smaller batches (say, 1000), each followed by a commit.

For intermediary stats (monthly, weekly), add a few boolean fields (bit or tinyint in MySQL) to your stats table. Have each of these store whether they're to be counted for with monthly, weekly, daily stats, etc. Place a trigger on them as well, in such a way that they increase or decrease the applicable totals in your stat_totals table.

As a closing note, give some thoughts on where you want the actual count to be stored. It needs to be an indexed field, and the latter is going to be heavily updated. Typically, you'll want it stored in its own table, rather than in the pages table, in order to avoid cluttering your pages table with (much larger) dead rows.

Assuming you did all the above your final query becomes:

select p.*
from pages p join stat_totals s using (page_id)
order by s.weekly_total desc limit 10

It should be plenty fast with the index on weekly_total.

Lastly, let's not forget the most obvious of all: if you're running these same total/monthly/weekly/etc queries over and over, their result should be placed in memcache too.

Answer 3

you can add indexes and try tweaking your SQL but the real solution here is to cache the results.

you should really only need to caclulate the last 7/30 days of traffic once daily

and you could do the past 24 hours hourly?

even if you did it once every 5 minutes, that's still a huge savings over running the (expensive) query for every hit of every user.

Answer 4

RRDtool

Many tools/systems do not build their own logging and log aggregation but use RRDtool (round-robin database tool) to efficiently handle time-series data. RRDtools also comes with powerful graphing subsystem, and (according to Wikipedia ) there are bindings for PHP and other languages.

From your questions I assume you don't need any special and fancy analysis and RRDtool would efficiently do what you need without you having to implement and tune your own system.

Answer 5

You can do some 'aggregation' in te background, for example by a con job. Some suggestions (in no particular order) that might help:

1. Create a table with hourly results. This means you can still create the statistics you want, but you reduce the amount of data to (24*7*4 = about 672 records per page per month).

your table can be somewhere along the lines of this:

hourly_results (
nid integer,
start_time datetime,
amount integer
)

after you parse them into your aggregate table you can more or less delete them.

2.Use result caching (memcache, apc) You can easily store the results (which should not change every minute, but rather every hour?), either in a memcache database (which again you can update from a cronjob), use the apc user cache (which you can't update from a cronjob) or use file caching by serializing objects/results if you're short on memory.

3. Optimize your database 10 seconds is a long time. Try to find out what is happening with your database. Is it running out of memory? Do you need more indexes?

Optimizing queries for content popularity by hits

Question

5 answers

solution1
2 ACCPTED 2011-06-02 09:14:57

solution2
1 2011-06-02 11:41:33

solution3
0 2011-06-02 08:39:27

solution4
0 2011-06-02 08:56:39

solution5
0 2011-06-02 09:47:29

Optimizing queries for content popularity by hits

Question

5 answers

solution1 2 ACCPTED 2011-06-02 09:14:57

solution2 1 2011-06-02 11:41:33

solution3 0 2011-06-02 08:39:27

solution4 0 2011-06-02 08:56:39

solution5 0 2011-06-02 09:47:29

solution1
2 ACCPTED 2011-06-02 09:14:57

solution2
1 2011-06-02 11:41:33

solution3
0 2011-06-02 08:39:27

solution4
0 2011-06-02 08:56:39

solution5
0 2011-06-02 09:47:29