Select 行在大表中具有唯一值的特定列

Question

table1 has 3 columns in my database: id , timestamp , cluster and it has about 1M rows. table1在我的数据库中有 3 列： id 、 timestamp 、 cluster ，它有大约 1M 行。 I want to query the newest 24 rows with unique cluster value (no row must have repeated cluster value in the returned 24 rows).我想用唯一的集群值查询最新的 24 行（返回的 24 行中没有行必须有重复的集群值）。 the usual solution would be:通常的解决方案是：

SELECT
    *
FROM table1
GROUP BY cluster
ORDER BY timestamp DESC
LIMIT 24

however, since I have 1M rows, this query takes so long to be executed.但是，由于我有 1M 行，因此执行此查询需要很长时间。 so my solution was to run:所以我的解决方案是运行：

WITH x AS
(
    SELECT
        *
    FROM `table1`
    ORDER BY timestamp DESC
    LIMIT 50
)
SELECT
    *
FROM x
GROUP BY x.cluster
ORDER BY x.timestamp DESC
LIMIT 24

which assumes we can find 24 rows with unique cluster value in every 50 rows.假设我们可以在每 50 行中找到 24 行具有唯一聚类值的行。 this query runs much faster (~.007 sec).这个查询运行得更快（~.007 秒）。 now I want to ask is there any more efficient/routine way for such case?现在我想问这种情况有没有更有效/常规的方法？

Answer 1

Your assumption that in the last 50 rows you will find 24 different clusters may not be correct.您假设在最后 50 行中您会发现 24 个不同的集群可能不正确。

Try with ROW_NUMBER() window function:尝试使用ROW_NUMBER() window function：

SELECT *
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY cluster ORDER BY timestamp DESC) rn
  FROM table1
) t
WHERE rn = 1
ORDER BY timestamp DESC LIMIT 24

Answer 2

You can use row_number() , but you need the right indexes:您可以使用row_number() ，但您需要正确的索引：

select t.*
from (select t.*,
             row_number() over (partition by cluster order by timestamp desc) as seqnum
      from t
     ) t
where seqnum = 1
order by timestamp desc
limit 24;

The index you want is on (cluster, timestamp desc) .您想要的索引在(cluster, timestamp desc)上。

For your purposes, this may still not be sufficient because it is still processing all the rows, even with an index, when you only need a couple of dozen.出于您的目的，这可能仍然不够，因为当您只需要几十个时，它仍在处理所有行，即使使用索引也是如此。

I don't know how many recent rows you need to be sure that you have 24 clusters.我不知道你需要多少最近的行来确保你有 24 个集群。 However, you might find that this works better if we assume that the most recent 1000 rows have at least 24 clusters:但是，如果我们假设最近的 1000 行至少有 24 个集群，您可能会发现这会更好：

select t.*
from (select t.*,
             row_number() over (partition by cluster order by timestamp desc) as seqnum
      from (select t.*
            from t
            order by timestamp desc
            limit 1000
           ) t
     ) t
where seqnum = 1
order by timestamp desc
limit 24;

For this, you want an index only on (timestamp desc) .为此，您只需要(timestamp desc)上的索引。

Note: You might find that a where clause on the timestamp works better in this case:注意：在这种情况下，您可能会发现时间戳上的where子句效果更好：

where timestamp > now() - interval 24 hour

for instance to only consider rows in the past 24 hours.例如，仅考虑过去 24 小时内的行。

Answer 3

Since you want "one specific cluster value", this will be fast:由于您想要“一个特定的集群值”，这将很快：

SELECT
    *
FROM table1
WHERE cluster = ?
ORDER BY timestamp DESC
LIMIT 24

And have并且有

INDEX(cluster, timestamp)

If that is not what you want, please reword the title and the Question.如果这不是您想要的，请改写标题和问题。

Select 行在大表中具有唯一值的特定列

问题描述

3 个解决方案

解决方案1
1 2021-05-29 13:01:20

解决方案2
1 已采纳 2021-05-29 13:03:25

解决方案3
0 2021-05-29 16:49:48

Select 行在大表中具有唯一值的特定列

问题描述

3 个解决方案

解决方案1 1 2021-05-29 13:01:20

解决方案2 1 已采纳 2021-05-29 13:03:25

解决方案3 0 2021-05-29 16:49:48

解决方案1
1 2021-05-29 13:01:20

解决方案2
1 已采纳 2021-05-29 13:03:25

解决方案3
0 2021-05-29 16:49:48