简体   繁体   中英

SQL count many-to-many values or have it counted every time new row is added?

I am using MySQL (MyISAM) 5.0.41 and I have this query:

SELECT `x`.`items`.id, `x`.`items`.name, COUNT(*) AS count
    FROM `x`.`items` INNER JOIN `x`.`user_items`
    ON `x`.`items`.id = `x`.`user_items`.item_id
    GROUP BY name HAVING count > 2 ORDER BY count DESC

I have about 36,000 users, 175,000 user_items and 60,000 items which are constantly added to. So this query is getting a bit slow...

Is it better to:

  • Have a count field in items and update that periodically (say each time a user adds an item)
  • or run the query like this (slowly)..

Or is there any SQL that will populate the count field for me?

Thanks

You can use an intermediate solution:

  • Add a ts DATETIME column to the user_items table which would describe the time the user added the item

  • Add a ts DATETIME column to the users table which would describe the point of actuality, as long as cnt , the cached count column

  • Periodically update the users table with the new count and timestamp:

     INSERT INTO users (id, ts, cnt) SELECT * FROM ( SELECT user_id, NOW() AS nts, COUNT(*) AS ncnt FROM user_items ui WHERE ui.timestamp <= NOW() ) ON DUPLICATE KEY UPDATE ts = nnow, cnt = ncnt 
  • Invalidate the user's timestamp when a user_items entry is deleted

  • Issue this query to count the items:

     SELECT u.id, u.cnt + ( SELECT COUNT(*) FROM user_items ui WHERE ui.ts > u.ts AND ui.user_id = u.id ) FROM users 

This way, only the newly added items will be counted in the user_items table which is much faster, and you won't have concurrency issues with updating the records too often.

You should start by indexing user_items.item_id and grouping on it instead of name. Strings are much slower to group by (try it out for yourself), and the index should speed things up a bit more. If that still is too slow, you could run the GROUP BY query first and then join on the items table if your DBMS execution plan isn't doing that by default.

That query is pretty much doing a full table scan every time. There is no way around that. Indexes will speed things up my speeding up the join, but the query will just get slower and slower as your data grows.

Storing summary data, like the "count" with the "items" would be the way to go. You can do this with stored procedures or through code. As a double check, you can periodically (ie once per day) update all counts so you know they are accurate.

My impulse would be to leave the data in something like normal form (in other words, do not increment a "count" field), and then cache the result of the slow query at the application level.

If caching is ineffective, because many people are doing the query, and few of them do it twice, then, yes, you can set up a stored procedure that automatically updates some row in some table. The details vary depending on DB vendor. Here's how to do it in Postgresql . This is the only safe way to do it (ie, within the DB, and not from the application layer) due to race conditions.

Are you really getting all 36,000 users every time that you run your query? If you're looking to find the source of a performance issue then that could be it right there.

Depending on your RDBMS you could look at things like indexed or materialized views. Including the count as part of the table and trying to maintain it will almost certainly be a mistake, especially with the small size of your database.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM