[英]Select rows with unique values for one specific column in large table
table1
has 3 columns in my database: id
, timestamp
, cluster
and it has about 1M rows. table1
在我的数据库中有 3 列: id
、 timestamp
、 cluster
,它有大约 1M 行。 I want to query the newest 24 rows with unique cluster value (no row must have repeated cluster value in the returned 24 rows).我想用唯一的集群值查询最新的 24 行(返回的 24 行中没有行必须有重复的集群值)。 the usual solution would be:通常的解决方案是:
SELECT
*
FROM table1
GROUP BY cluster
ORDER BY timestamp DESC
LIMIT 24
however, since I have 1M rows, this query takes so long to be executed.但是,由于我有 1M 行,因此执行此查询需要很长时间。 so my solution was to run:所以我的解决方案是运行:
WITH x AS
(
SELECT
*
FROM `table1`
ORDER BY timestamp DESC
LIMIT 50
)
SELECT
*
FROM x
GROUP BY x.cluster
ORDER BY x.timestamp DESC
LIMIT 24
which assumes we can find 24 rows with unique cluster value in every 50 rows.假设我们可以在每 50 行中找到 24 行具有唯一聚类值的行。 this query runs much faster (~.007 sec).这个查询运行得更快(~.007 秒)。 now I want to ask is there any more efficient/routine way for such case?现在我想问这种情况有没有更有效/常规的方法?
Your assumption that in the last 50 rows you will find 24 different clusters may not be correct.您假设在最后 50 行中您会发现 24 个不同的集群可能不正确。
Try with ROW_NUMBER()
window function:尝试使用ROW_NUMBER()
window function:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY cluster ORDER BY timestamp DESC) rn
FROM table1
) t
WHERE rn = 1
ORDER BY timestamp DESC LIMIT 24
You can use row_number()
, but you need the right indexes:您可以使用row_number()
,但您需要正确的索引:
select t.*
from (select t.*,
row_number() over (partition by cluster order by timestamp desc) as seqnum
from t
) t
where seqnum = 1
order by timestamp desc
limit 24;
The index you want is on (cluster, timestamp desc)
.您想要的索引在(cluster, timestamp desc)
上。
For your purposes, this may still not be sufficient because it is still processing all the rows, even with an index, when you only need a couple of dozen.出于您的目的,这可能仍然不够,因为当您只需要几十个时,它仍在处理所有行,即使使用索引也是如此。
I don't know how many recent rows you need to be sure that you have 24 clusters.我不知道你需要多少最近的行来确保你有 24 个集群。 However, you might find that this works better if we assume that the most recent 1000 rows have at least 24 clusters:但是,如果我们假设最近的 1000 行至少有 24 个集群,您可能会发现这会更好:
select t.*
from (select t.*,
row_number() over (partition by cluster order by timestamp desc) as seqnum
from (select t.*
from t
order by timestamp desc
limit 1000
) t
) t
where seqnum = 1
order by timestamp desc
limit 24;
For this, you want an index only on (timestamp desc)
.为此,您只需要(timestamp desc)
上的索引。
Note: You might find that a where
clause on the timestamp works better in this case:注意:在这种情况下,您可能会发现时间戳上的where
子句效果更好:
where timestamp > now() - interval 24 hour
for instance to only consider rows in the past 24 hours.例如,仅考虑过去 24 小时内的行。
Since you want "one specific cluster value", this will be fast:由于您想要“一个特定的集群值”,这将很快:
SELECT
*
FROM table1
WHERE cluster = ?
ORDER BY timestamp DESC
LIMIT 24
And have并且有
INDEX(cluster, timestamp)
If that is not what you want, please reword the title and the Question.如果这不是您想要的,请改写标题和问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.