[英]Ranking in MySQL, how do I get the best performance with frequent updates and a large data set?
I want grouped ranking on a very large table, I've found a couple of solutions for this problem eg in this post and other places on the web. 我想在一个非常大的桌子上进行分组排名,我已经找到了一些解决此问题的方法,例如在这篇文章和网络上的其他地方。 I am, however, unable to figure out the worst case complexity of these solutions. 但是,我无法弄清楚这些解决方案的最坏情况。 The specific problem consists of a table where each row has a number of points and a name associated. 特定的问题包括一个表格,其中每行都有许多点和一个关联的名称。 I want to be able to request rank intervals such as 1-4. 我希望能够请求等级间隔,例如1-4。 Here are some data examples: 以下是一些数据示例:
name | points
Ab 14
Ac 14
B 16
C 16
Da 15
De 13
With these values the following "ranking" is created: 使用这些值创建以下“排名”:
Query id | Rank | Name
1 1 B
2 1 C
3 3 Da
4 4 Ab
5 4 Ac
6 6 De
And it should be possible to create the following interval on query-id's: 2-5 giving rank: 1,3,4 and 4. 并且应该可以在查询ID上创建以下间隔:2-5,给出等级:1、3、4和4。
The database holds about 3 million records so if possible I want to avoid a solution with complexity greater than log(n). 该数据库拥有约300万条记录,因此,如果可能的话,我要避免复杂度大于log(n)的解决方案。 There are constantly updates and inserts on the database so these actions should preferably be performed in log(n) complexity as well. 数据库上不断更新和插入,因此最好也以log(n)复杂度执行这些操作。 I am not sure it's possible though and I've tried wrapping my head around it for some time. 但我不确定是否有可能,并且我尝试过将头缠住一段时间。 I've come to the conclusion that a binary search should be possible but I haven't been able to create a query that does this. 我得出的结论是应该可以执行二进制搜索,但是我无法创建执行此操作的查询。 I am using a MySQL server. 我正在使用MySQL服务器。
I will elaborate on how the pseudo code for the filtering could work. 我将详细说明过滤的伪代码如何工作。 Firstly, an index on (points, name) is needed. 首先,需要索引(点,名称)。 As input you give a fromrank and a tillrank. 作为输入,您给出一个fromrank和一个tillrank。 The total number of records in the database is n. 数据库中的记录总数为n。 The pseudocode should look something like this: 伪代码应如下所示:
Find median point value, count rows less than this value (the count gives a rough estimate of rank, not considering those with same amount of points). 查找中点值,对少于该值的行进行计数(该计数给出了排名的粗略估计,而不考虑那些具有相同数量点的行)。 If the number returned is greater than the fromrank delimiter, we subdivide the first half and find median of it. 如果返回的数字大于fromrank分隔符,则我们将前半部分细分并找到它的中位数。 We keep doing this until we are pinpointed to the amount of points where fromrank should start. 我们一直这样做,直到我们确定了fromrank应该开始的点数。 then we do the same within that amount of points with the name index, and find median until we have reached the correct row. 然后我们使用名称索引在该数量的点内执行相同的操作,并找到中位数,直到到达正确的行。 We do the exact same thing for tillrank. 我们为耕种做完全相同的事情。
The result should be log(n) number of subdivisions. 结果应为log(n)细分数。 So given the median and count can be made in log(n) time it should be possible to solve the problem in worst case complexity log(n). 因此,鉴于中位数和计数可以用log(n)的时间表示,应该可以解决最坏情况下复杂度log(n)的问题。 Correct me if I am wrong. 如果我错了,请纠正我。
You need a stored procedure to be able to call this with parameters: 您需要一个存储过程才能使用参数进行调用:
CREATE TABLE rank (name VARCHAR(20) NOT NULL, points INTEGER NOT NULL);
CREATE INDEX ix_rank_points ON rank(points, name);
CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
SET @fromrank = fromrank;
SET @tillrank = tillrank;
PREPARE STMT FROM
'
SELECT rn, rank, name, points
FROM (
SELECT CASE WHEN @cp = points THEN @rank ELSE @rank := @rn + 1 END AS rank,
@rn := @rn + 1 AS rn,
@cp := points,
r.*
FROM (
SELECT @cp := -1, @rn := 0, @rank = 1
) var,
(
SELECT *
FROM rank
FORCE INDEX (ix_rank_points)
ORDER BY
points DESC, name DESC
LIMIT ?
) r
) o
WHERE rn >= ?
';
EXECUTE STMT USING @tillrank, @fromrank;
END;
CALL prc_ranks (2, 5);
If you create the index and force MySQL
to use it (as in my query), then the complexity of the query will not depend on the number of rows at all, it will depend only on tillrank
. 如果创建索引并强制MySQL
使用它(如我的查询中一样),那么查询的复杂性将完全不取决于行数,而仅取决于tillrank
。
It will actually take last tillrank
values from the index, perform some simple calculations on them and filter out first fromrank
values. 实际上,它将从索引中获取最后的tillrank
值,对其进行一些简单的计算,然后从fromrank
值中筛选出最先。
Time of this operation, as you can see, depends only on tillrank
, it does not depend on how many records are there. 如您所见,此操作的时间仅取决于tillrank
,而不取决于那里有多少条记录。
I just checked in on 400,000
rows, it selects ranks from 5
to 100
in 0,004
seconds (that is, instantly) 我刚刚检查了400,000
行,它在0,004
秒内(即即刻)从5
到100
等级中进行选择
Important: this only works if you sort on names in DESCENDING
order. 重要提示:仅当您按DESCENDING
顺序对名称进行排序时, DESCENDING
。 MySQL
does not support DESC
clause in the indices, that means that the points
and name
must be sorted in one order for INDEX SORT
to be usable (either both ASCENDING
or both DESCENDING
). MySQL
不支持DESC
的指标条款,即意味着该points
和name
必须在一个顺序排列的INDEX SORT
是可用的(或者两者ASCENDING
或双方DESCENDING
)。 If you want fast ASC
sorting by name
, you will need to keep negative points in the database, and change the sign in the SELECT
clause. 如果ASC
name
快速进行ASC
排序,则需要在数据库中保留负数 ,并在SELECT
子句中更改符号。
You may also remove name
from the index at all, and perform a final ORDER
'ing without using an index: 您也可以从索引中删除所有name
,并在不使用索引的情况下执行最终的ORDER
:
CREATE INDEX ix_rank_points ON rank(points);
CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
SET @fromrank = fromrank;
SET @tillrank = tillrank;
PREPARE STMT FROM
'
SELECT rn, rank, name, points
FROM (
SELECT CASE WHEN @cp = points THEN @rank ELSE @rank := @rn + 1 END AS rank,
@rn := @rn + 1 AS rn,
@cp := points,
r.*
FROM (
SELECT @cp := -1, @rn := 0, @rank = 1
) var,
(
SELECT *
FROM rank
FORCE INDEX (ix_rank_points)
ORDER BY
points DESC
LIMIT ?
) r
) o
WHERE rn >= ?
ORDER BY rank, name
';
EXECUTE STMT USING @tillrank, @fromrank;
END;
That will impact performance on big ranges, but you will hardly notice it on small ranges. 这会影响大范围的性能,但是您几乎不会注意到小范围的性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.