简体   繁体   English

SQL会计算多对多的值,还是每次添加新行时都计算它?

[英]SQL count many-to-many values or have it counted every time new row is added?

I am using MySQL (MyISAM) 5.0.41 and I have this query: 我正在使用MySQL(MyISAM)5.0.41,我有这个查询:

SELECT `x`.`items`.id, `x`.`items`.name, COUNT(*) AS count
    FROM `x`.`items` INNER JOIN `x`.`user_items`
    ON `x`.`items`.id = `x`.`user_items`.item_id
    GROUP BY name HAVING count > 2 ORDER BY count DESC

I have about 36,000 users, 175,000 user_items and 60,000 items which are constantly added to. 我有大约36,000个用户,175,000个user_items和60,000个不断添加的项目。 So this query is getting a bit slow... 所以这个查询变得有点慢......

Is it better to: 是否更好:

  • Have a count field in items and update that periodically (say each time a user adds an item) items有一个count字段并定期更新(比如每次用户添加项目时)
  • or run the query like this (slowly).. 或者像这样运行查询(慢慢地)..

Or is there any SQL that will populate the count field for me? 或者是否有任何SQL将为我填充计数字段?

Thanks 谢谢

You can use an intermediate solution: 您可以使用中间解决方案:

  • Add a ts DATETIME column to the user_items table which would describe the time the user added the item ts DATETIME列添加到user_items表,该表将描述用户添加项目的时间

  • Add a ts DATETIME column to the users table which would describe the point of actuality, as long as cnt , the cached count column 将一个ts DATETIME列添加到users表中,该表将描述实际情况,只要cnt ,缓存计数列

  • Periodically update the users table with the new count and timestamp: 使用新计数和时间戳定期更新users表:

     INSERT INTO users (id, ts, cnt) SELECT * FROM ( SELECT user_id, NOW() AS nts, COUNT(*) AS ncnt FROM user_items ui WHERE ui.timestamp <= NOW() ) ON DUPLICATE KEY UPDATE ts = nnow, cnt = ncnt 
  • Invalidate the user's timestamp when a user_items entry is deleted 删除user_items条目时,用户的时间戳无效

  • Issue this query to count the items: 发出此查询以计算项目:

     SELECT u.id, u.cnt + ( SELECT COUNT(*) FROM user_items ui WHERE ui.ts > u.ts AND ui.user_id = u.id ) FROM users 

This way, only the newly added items will be counted in the user_items table which is much faster, and you won't have concurrency issues with updating the records too often. 这样,只有新添加的项目才会在user_items表中计算得更快,并且您不会经常更新记录时出现并发问题。

You should start by indexing user_items.item_id and grouping on it instead of name. 您应该首先索引user_items.item_id并对其进行分组而不是名称。 Strings are much slower to group by (try it out for yourself), and the index should speed things up a bit more. 字符串分组要慢得多(自己试试),索引应该加快速度。 If that still is too slow, you could run the GROUP BY query first and then join on the items table if your DBMS execution plan isn't doing that by default. 如果仍然太慢,您可以首先运行GROUP BY查询,然后如果您的DBMS执行计划默认情况下没有这样做,则可以加入项目表。

That query is pretty much doing a full table scan every time. 该查询几乎每次都进行全表扫描。 There is no way around that. 没有办法解决这个问题。 Indexes will speed things up my speeding up the join, but the query will just get slower and slower as your data grows. 索引会加快我加速连接的速度,但随着数据的增长,查询会变得越来越慢。

Storing summary data, like the "count" with the "items" would be the way to go. 存储摘要数据,如“计数”和“项目”将是要走的路。 You can do this with stored procedures or through code. 您可以使用存储过程或代码执行此操作。 As a double check, you can periodically (ie once per day) update all counts so you know they are accurate. 作为双重检查,您可以定期(即每天一次)更新所有计数,以便您知道它们是准确的。

My impulse would be to leave the data in something like normal form (in other words, do not increment a "count" field), and then cache the result of the slow query at the application level. 我的冲动是将数据保留为正常形式(换句话说,不增加“计数”字段),然后在应用程序级别缓存慢查询的结果。

If caching is ineffective, because many people are doing the query, and few of them do it twice, then, yes, you can set up a stored procedure that automatically updates some row in some table. 如果缓存无效,因为许多人正在进行查询,而且很少有人进行两次,那么,是的,您可以设置一个存储过程来自动更新某些表中的某些行。 The details vary depending on DB vendor. 详细信息因数据库供应商而异。 Here's how to do it in Postgresql . 这是在Postgresql中如何做到这一点 This is the only safe way to do it (ie, within the DB, and not from the application layer) due to race conditions. 由于竞争条件,这是唯一安全的方法(即在DB内,而不是从应用程序层)。

Are you really getting all 36,000 users every time that you run your query? 每次运行查询时,您真的获得了36,000名用户吗? If you're looking to find the source of a performance issue then that could be it right there. 如果您正在寻找性能问题的根源,那么它可能就在那里。

Depending on your RDBMS you could look at things like indexed or materialized views. 根据您的RDBMS,您可以查看索引或物化视图等内容。 Including the count as part of the table and trying to maintain it will almost certainly be a mistake, especially with the small size of your database. 将计数作为表的一部分并尝试维护它几乎肯定是一个错误,特别是对于数据库的小尺寸。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM