[英]MySQL SELECT most frequent by group
如何获取 MySQL 中每个标签最常出现的类别? 理想情况下,我想模拟一个可以计算列模式的聚合函数。
SELECT
t.tag
, s.category
FROM tags t
LEFT JOIN stuff s
USING (id)
ORDER BY tag;
+------------------+----------+
| tag | category |
+------------------+----------+
| automotive | 8 |
| ba | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 8 |
| bamboo | 10 |
| bamboo | 8 |
| bamboo | 9 |
| bamboo | 8 |
| bamboo | 10 |
| bamboo | 8 |
| bamboo | 9 |
| bamboo | 8 |
| banana tree | 8 |
| banana tree | 8 |
| banana tree | 8 |
| banana tree | 8 |
| bath | 9 |
+-----------------------------+
SELECT t1.*
FROM (SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category) t1
LEFT OUTER JOIN
(SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category) t2
ON (t1.tag = t2.tag AND (t1.count < t2.count
OR t1.count = t2.count AND t1.category < t2.category))
WHERE t2.tag IS NULL
ORDER BY t1.count DESC;
我同意这对于单个 SQL 查询来说太过分了。 任何在子查询中使用GROUP BY
都会让我畏缩。 您可以使用视图使其看起来更简单:
CREATE VIEW count_per_category AS
SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category;
SELECT t1.*
FROM count_per_category t1
LEFT OUTER JOIN count_per_category t2
ON (t1.tag = t2.tag AND (t1.count < t2.count
OR t1.count = t2.count AND t1.category < t2.category))
WHERE t2.tag IS NULL
ORDER BY t1.count DESC;
但它基本上在幕后做同样的工作。
您评论说您可以在应用程序代码中轻松执行类似的操作。 那你为什么不这样做呢? 执行更简单的查询以获取每个类别的计数:
SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category;
并在应用程序代码中对结果进行排序。
SELECT tag, category
FROM (
SELECT @tag <> tag AS _new,
@tag := tag AS tag,
category, COUNT(*) AS cnt
FROM (
SELECT @tag := ''
) vars,
stuff
GROUP BY
tag, category
ORDER BY
tag, cnt DESC
) q
WHERE _new
在您的数据上,这将返回以下内容:
'automotive', 8
'ba', 8
'bamboo', 8
'bananatree', 8
'bath', 9
这是测试脚本:
CREATE TABLE stuff (tag VARCHAR(20) NOT NULL, category INT NOT NULL);
INSERT
INTO stuff
VALUES
('automotive',8),
('ba',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',10),
('bamboo',8),
('bamboo',9),
('bamboo',8),
('bamboo',10),
('bamboo',8),
('bamboo',9),
('bamboo',8),
('bananatree',8),
('bananatree',8),
('bananatree',8),
('bananatree',8),
('bath',9);
(编辑:在 ORDER BY 中忘记了 DESC)
在子查询中使用 LIMIT 很容易。 MySQL 仍然有 no-LIMIT-in-subqueries 限制吗? 下面的例子是使用 PostgreSQL。
=> select tag, (select category from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS category, (select count(*) from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS num_items from stuff s group by tag;
tag | category | num_items
------------+----------+-----------
ba | 8 | 1
automotive | 8 | 1
bananatree | 8 | 4
bath | 9 | 1
bamboo | 8 | 9
(5 rows)
仅当您需要计数时才需要第三列。
这是针对更简单的情况:
SELECT action, COUNT(action) AS ActionCount FROM log GROUP BY action ORDER BY ActionCount DESC;
这是一个 hacky 方法,它利用了max
聚合函数,因为 MySQL(或窗口函数等)中没有模式聚合函数允许这样做:
SELECT
tag,
convert(substring(max(concat(lpad(c, 20, '0'), category)), 21), int)
AS most_frequent_category
FROM (
SELECT tag, category, count(*) AS c
FROM tags INNER JOIN stuff using (id)
GROUP BY tag, category
) as grouped_cats
GROUP BY tag;
基本上它利用了这样一个事实,即我们可以找到每个单独类别的计数的词法最大值。
使用命名类别更容易看到这一点:
create temporary table tags (id int auto_increment primary key, tag character varying(20));
create temporary table stuff (id int, category character varying(20));
insert into tags (tag) values ('automotive'), ('ba'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('banana tree'), ('banana tree'), ('banana tree'), ('banana tree'), ('bath');
insert into stuff (id, category) values (1, 'cat-8'), (2, 'cat-8'), (3, 'cat-8'), (4, 'cat-8'), (5, 'cat-8'), (6, 'cat-8'), (7, 'cat-8'), (8, 'cat-10'), (9, 'cat-8'), (10, 'cat-9'), (11, 'cat-8'), (12, 'cat-10'), (13, 'cat-8'), (14, 'cat-9'), (15, 'cat-8'), (16, 'cat-8'), (17, 'cat-8'), (18, 'cat-8'), (19, 'cat-8'), (20, 'cat-9');
在这种情况下,我们不应该对most_frequent_category
列进行整数转换:
SELECT
tag,
substring(max(concat(lpad(c, 20, '0'), category)), 21) AS most_frequent_category
FROM (
SELECT tag, category, count(*) AS c
FROM tags INNER JOIN stuff using (id)
GROUP BY tag, category
) as grouped_cats
GROUP BY tag;
+-------------+------------------------+
| tag | most_frequent_category |
+-------------+------------------------+
| automotive | cat-8 |
| ba | cat-8 |
| bamboo | cat-8 |
| banana tree | cat-8 |
| bath | cat-9 |
+-------------+------------------------+
为了更深入地了解正在发生的事情,这里是grouped_cats
内部选择的样子(我添加了order by tag, c desc
):
+-------------+----------+---+
| tag | category | c |
+-------------+----------+---+
| automotive | cat-8 | 1 |
| ba | cat-8 | 1 |
| bamboo | cat-8 | 9 |
| bamboo | cat-10 | 2 |
| bamboo | cat-9 | 2 |
| banana tree | cat-8 | 4 |
| bath | cat-9 | 1 |
+-------------+----------+---+
如果我们省略substring
位,我们可以看到count(*)
列的最大值如何沿其关联的类别拖动:
SELECT
tag,
max(concat(lpad(c, 20, '0'), category)) AS xmost_frequent_category
FROM (
SELECT tag, category, count(*) AS c
FROM tags INNER JOIN stuff using (id)
GROUP BY tag, category
) as grouped_cats
GROUP BY tag;
+-------------+---------------------------+
| tag | xmost_frequent_category |
+-------------+---------------------------+
| automotive | 00000000000000000001cat-8 |
| ba | 00000000000000000001cat-8 |
| bamboo | 00000000000000000009cat-8 |
| banana tree | 00000000000000000004cat-8 |
| bath | 00000000000000000001cat-9 |
+-------------+---------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.