简体   繁体   English

在mysql中有更好的方法吗? -使用另一个选择和分组依据更新整个列

[英]Is there a better way of doing this in mysql? - update entire column with another select and group by

I have a table sample with two columns id and cnt and another table PostTags with two columns postid and tagid 我有一个表sample有两列idcnt和另一个表PostTags有两列postidtagid

I want to update all cnt values with their corresponding counts and I have written the following query: 我想用其对应的计数更新所有的cnt值,并编写了以下查询:

UPDATE sample SET
cnt = (SELECT COUNT(tagid) 
       FROM PostTags 
       WHERE sample.postid = PostTags.postid 
       GROUP BY PostTags.postid)

I intend to update entire column at once and I seem to accomplish this. 我打算立即更新整个专栏,而我似乎做到了。 But performance-wise, is this the best way? 但是从性能角度来看,这是最好的方法吗? Or is there a better way? 或者,还有更好的方法?

EDIT 编辑

I've been running this query (without GROUP BY) for over 1 hour for ~18m records. 我已经运行了这个查询(没有GROUP BY)超过1个小时,记录了约1800万条记录。 I'm looking for a query that is better in performance. 我正在寻找性能更好的查询。

Remove the unnecessary GROUP BY and the statement looks good. 删除不必要的GROUP BY,该语句看起来不错。 If however you expect many sample.set already to contain the correct value, then you would update many records that need no update. 但是,如果您希望许多sample.set已经包含正确的值,那么您将更新许多不需要更新的记录。 This may create some overhead (larger rollback segments, triggers executed etc.) and thus take longer. 这可能会产生一些开销(较大的回滚段,执行的触发器等),因此会花费更长的时间。

In order to only update the records that need be updated, join: 为了只更新需要更新的记录,请加入:

UPDATE sample
INNER JOIN 
(
  SELECT postid, COUNT(tagid) as cnt
  FROM PostTags 
  GROUP BY postid
) tags ON tags.postid = sample.postid
SET sample.cnt = tags.cnt
WHERE sample.cnt != tags.cnt OR sample.cnt IS NULL;

Here is the SQL fiddle: http://sqlfiddle.com/#!2/d5e88 . 这是SQL提琴: http ://sqlfiddle.com/#!2/ d5e88

That query should not take an hour. 该查询不应花费一个小时。 I just did a test, running a query like yours on a table of 87520 keywords and matching rows in a many-to-many table of 2776445 movie_keyword rows. 我只是做了一个测试,对87520个keywords的表运行像您这样的查询,并在2776445个movie_keyword行的多对多表中movie_keyword行。 In my test, it took 32 seconds . 在我的测试中,花了32秒

The crucial part that you're probably missing is that you must have an index on the lookup column, which is PostTags.postid in your example. 您可能缺少的关键部分是您必须在查找列上有一个索引,该PostTags.postid在您的示例中为PostTags.postid

Here's the EXPLAIN from my test (finally we can do EXPLAIN on UPDATE statements in MySQL 5.6): 这是我的测试中的EXPLAIN(最后,我们可以对MySQL 5.6中的UPDATE语句执行EXPLAIN):

mysql> explain update kc1 set count = 
  (select count(*) from movie_keyword 
   where kc1.keyword_id = movie_keyword.keyword_id) \G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: kc1
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 98867
        Extra: Using temporary
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: movie_keyword
         type: ref
possible_keys: k_m
          key: k_m
      key_len: 4
          ref: imdb.kc1.keyword_id
         rows: 17
        Extra: Using index

Having an index on keyword_id is important. keyword_id上建立索引很重要。 In my case, I had a compound index, but a single-column index would help too. 就我而言,我有一个复合索引,但是单列索引也有帮助。

CREATE TABLE `movie_keyword` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `movie_id` int(11) NOT NULL,
  `keyword_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `k_m` (`keyword_id`,`movie_id`)
);

The difference between COUNT(*) and COUNT(movie_id) should be immaterial, assuming movie_id is NOT NULLable. 假设movie_id不可为空,则COUNT(*)COUNT(movie_id)之间的区别应该不重要。 But I use COUNT(*) because it'll still count as an index-only query if my index is defined only on the keyword_id column. 但是我使用COUNT(*)因为如果仅在keyword_id列上定义了我的索引,它将仍然算作仅索引的查询。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM