简体   繁体   English

SQL 查询按 ID 分组的行计数,但限制每组的计数

[英]SQL query to count rows grouped by an ID, but limit count on each group

So I have a bit of an unusual request.所以我有一个不寻常的要求。 I'm working with a table with billions of rows.我正在处理一个有数十亿行的表。

The table has a column 'id' which is not unique, and has a column 'data'该表有一列“id”不是唯一的,并且有一列“数据”

What I want to do is run a count on the number of rows grouped by the 'id', but limit the counting to only 150 entries.我想要做的是对按“id”分组的行数进行计数,但将计数限制为仅 150 个条目。 I only need to know if there are 150 rows by any given id.我只需要知道任何给定的 id 是否有 150 行。

This is in an effort to optimize the query and performance.这是为了优化查询和性能。

It doesn't have to be a count.它不必是一个计数。 I only need to know if a given id has 150 entries, without have MySQL continue counting entries during the query.我只需要知道给定的 id 是否有 150 个条目,而不需要 MySQL 在查询期间继续计数条目。 If that makes sense.如果这是有道理的。

I know how to count, and I know how to group, and I know how to do both, but the count will come back with a number in the millions which is wasted processing time and the query needs to run on hundred of thousands of ids.我知道如何计数,我知道如何分组,并且我知道如何做到这两点,但是计数会返回数以百万计的数字,这浪费了处理时间,并且查询需要在数十万个 id 上运行.

You can't really optimize performance for this -- I don't think.你不能真正优化性能——我不这么认为。

select id, (count(*) >= 150)
from t
group by id;

If you happen to have a separate table with one row per id and an index on t(id) , then this might be faster:如果您碰巧有一个单独的表,每个 id 有一行,并且在t(id)上有一个索引,那么这可能会更快:

select ids.id,
       ((select count(*)
         from t
         where t.id = ids.id
        ) >= 150
       )
from ids;

Unfortunately, MySQL does not support double nesting for correlated subqueries, so this is not possible:不幸的是,MySQL 不支持相关子查询的双重嵌套,所以这是不可能的:

select ids.id,
       ((select count(*)
         from (select 1
               from t
               where t.id = ids.id
               limit 150
              ) t
        ) >= 150
       )
from ids;

If so, this might be faster.如果是这样,这可能会更快。

EDIT:编辑:

If you have an index on id and only want ids that have 150 or more, then variables might be faster:如果您在id上有一个索引并且只想要具有 150 或更多的 id,那么变量可能会更快:

select id,
       (@rn := if(@id = id, @rn + 1,
                  if(@id := id, 1, 1)
                 )
       ) as rn
from (select id
      from t
      order by id
     ) t cross join
     (select @id := 0, @rn := 0) params
having rn = 150;

The thinking here is that using the index to order the table, materializing, and scanning again is probably faster than group by .这里的想法是,使用索引对表进行排序、物化和再次扫描可能比group by更快。 I don't think row_number() would have the same performance characteristics.我认为row_number()不会具有相同的性能特征。

EDIT II:编辑二:

A slight variation on the above can be used to get all ids with a flag:可以使用上面的轻微变化来获取带有标志的所有 id:

select id, (max(id) = 150)
from (select id,
             (@rn := if(@id = id, @rn + 1,
                        if(@id := id, 1, 1)
                       )
             ) as rn
      from (select id
            from t
            order by id
           ) t cross join
           (select @id := 0, @rn := 0) params
      having rn in (1, 150)
     ) t
group by id;

EDIT III:编辑三:

Ahh, If you have a separate table of ids: then this might be the best approach:啊,如果你有一个单独的 id 表:那么这可能是最好的方法:

select ids.id,
       (select id
        from t
        where t.id = ids.id
        limit 1 offset 149
       ) is not null
from ids;

This will fetch the 150th row from the index.这将从索引中获取第 150 行。 If it not there, then no row is returned.如果它不存在,则不返回任何行。

I don't think that this is possible.我不认为这是可能的。 You will have to scan the entire table to know which id s have at least 150 entries.必须扫描整个表才能知道哪些id至少有 150 个条目。

So:所以:

select id
from mytable
group by id
having count(*) >= 150

With an index on id , this should be as efficient as it can be.使用id上的索引,这应该尽可能高效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM