简体   繁体   English

优化返回行数非常大的MySQL查询

[英]Optimizing MySQL query where the number of returned rows is very large

Context: We have a website where users(merchants) can add their apps/websites into the system and pay their users via API.背景:我们有一个网站,用户(商家)可以将他们的应用程序/网站添加到系统中,并通过 API 向用户付款。 Now, the problem comes when we have to show the list of those transactions to the merchant on their dashboard.现在,当我们必须在他们的仪表板上向商家显示这些交易的列表时,问题就来了。 Each merchant generates hundreds of transactions per second and on average merchants have like 2 million transactions per day and on the dashboard, we have to show today's stats to the merchant.每个商家每秒产生数百笔交易,平均每天商家有大约 200 万笔交易,在仪表板上,我们必须向商家显示今天的统计数据。

Main Problem: We have to show today's transactions to the merchant which is around 2 million records for a single merchant.主要问题:我们必须向商家显示今天的交易,单个商家大约有 200 万条记录。 So a query like this,所以像这样的查询,

SELECT * FROM transactions WHERE user_id = 123 LIMIT 0,15

Rows examined are 2 million in our example and that cannot be reduced in any way.在我们的示例中,检查的行数为 200 万,并且不能以任何方式减少。 The limit doesn't help here I think, because MySQL will still examine all rows an then pick first 15 from the result set.我认为这个限制在这里没有帮助,因为 MySQL 仍然会检查所有行,然后从结果集中选择前 15 行。

How can we optimize queries like this where we have to show millions of records(with pagination of course) to the user?我们如何优化这样的查询,我们必须向用户显示数百万条记录(当然还有分页)?

Edit:编辑:

Explain output:解释输出:

在此处输入图片说明

Query:询问:

explain select a.id, a.user_app_id, a.created_at, a.type, a.amount, a.currency_id, b.name, b.url from transactions as a left join user_apps as b on a.user_app_id = b.id where a.sender_user_id = ? and a.created_at BETWEEN '2020-03-20' AND '2020-03-21' order by a.created_at desc limit 15 offset 0

Details:细节:

Index sender_user_id_2 is an composite index of sender_user_id(int) and created_at(timestamp) column.索引sender_user_id_2sender_user_id(int)created_at(timestamp)列的复合索引。

This query is taking 5 to 15 seconds to return 15 rows.此查询需要 5 到 15 秒才能返回 15 行。

If I run the same query for the sender_user_id which has only 24 transactions in the table, then the response in instant.如果我对表中只有 24 个事务的 sender_user_id 运行相同的查询,则立即响应。

First, let's fix what might be a bug: You are including two midnights in that "day".首先,让我们修复可能存在的错误:您在那个“一天”中包含了两个午夜。 BETWEEN is "inclusive". BETWEEN是“包容性”。

 AND  a.created_at BETWEEN '2020-03-20' AND '2020-03-21'

--> -->

 AND  a.created_at >= '2020-03-20'
 AND  a.created_at  < '2020-03-20' + INTERVAL 1 DAY

(There is no performance change, just the elimination of tomorrow's midnight.) (没有性能变化,只是消除了明天的午夜。)

In your simple query, only 15 rows will be touched due to the LIMIT .在您的简单查询中,由于LIMIT ,只会触及 15 行。 However, for more complex queries it may need to gather all rows, sort them, and only then peel off 15 rows.然而,对于更复杂的查询,它可能需要收集所有行,对它们进行排序,然后才剥离 15 行。 The technique for preventing that inefficiency goes something like this: Devise, if possible, an INDEX that handles all of the WHERE and the ORDER BY .防止这种低效率的技术是这样的:如果可能,设计一个处理所有WHEREORDER BYINDEX

    where  a.sender_user_id = ?
      AND  a.created_at >= '2020-03-20'
      AND  a.created_at  < '2020-03-20' + INTERVAL 1 DAY
    order by  a.created_at desc

needs INDEX(sender_user_id, created_at) -- in that order.需要INDEX(sender_user_id, created_at) -- 按照这个顺序。 (And, in your query, nothing else encroaches on that.) (而且,在您的查询中,没有其他内容会侵犯它。)

Pagination via OFFSET introduces another performance problem -- it must step over all OFFSET rows before getting the ones you want.通过OFFSET分页引入了另一个性能问题——它必须在获得您想要的行之前遍历所有OFFSET行。 This is solvable by remembering where you left off .这可以通过记住你离开的地方来解决

So, why does EXPLAIN think it will hit a million rows?那么,为什么EXPLAIN认为它会达到一百万行呢? Because Explain is dumb when it comes to handling LIMIT .因为 Explain 在处理LIMIT时是愚蠢的。 There is a better way to estimate the effort.有一种更好的方法来估计工作量。 That will show 15, not a million, if all is working well.如果一切正常,那将显示 15,而不是 100 万。 For LIMIT 150, 15 , it will show 165.对于LIMIT 150, 15 ,它将显示 165。

You said "Index sender_user_id_2 is an composite index of sender_user_id(int) and created_at(timestamp) column."您说“索引 sender_user_id_2 是 sender_user_id(int) 和 created_at(timestamp) 列的复合索引。” Can you provide SHOW CREATE TABLE so we can check for something else subtle going on?你能提供SHOW CREATE TABLE以便我们可以检查其他微妙的事情吗?

Hmmm... I wonder if嗯...我想知道是否

order by  a.created_at desc

should be changed to match the index:应更改以匹配索引:

order by a.sender_user_id DESC, a.created_at desc

(What version of MySQL are you using? I did some experimenting and found no difference because of having (or not) sender_user_id in the `ORDER BY.) (你使用的是什么版本的 MySQL?我做了一些实验,发现没有区别,因为在 ORDER BY 中有(或没有) sender_user_id 。)

(Trouble -- It seems that the JOIN prevents the effective use of LIMIT . Still digging...) (麻烦 - 似乎JOIN阻止了LIMIT的有效使用。仍在挖掘......)

New suggestion:新建议:

select  a.id, a.user_app_id, a.created_at, a.type, a.amount, a.currency_id,
        b.name, b.url
    from  
    (
        SELECT  a1.id
            FROM  transactions as a1
            where  a1.sender_user_id = ?
              AND  a.created_at >= '2020-03-20'
              AND  a.created_at  < '2020-03-20' + INTERVAL 1 DAY
            order by  a1.created_at desc
            limit  15 offset 0 
    ) AS x
    JOIN  transactions AS a USING(id)
    left join  user_apps as b  ON x.user_app_id = b.id 

This uses a generic 'trick' to move the LIMIT into a derived table, with minimal other stuff.这使用通用的“技巧”将LIMIT移动到派生表中,其他东西最少。 Then, with only 15 ids, the JOINs to other tables goes 'fast'.然后,只有 15 个 id,到其他表的JOINs变得“快速”。

In my experiment (with a different pair of tables), it touched only 5*15 rows.在我的实验中(使用一对不同的表),它只触及 5*15 行。 I checked multiple versions;我检查了多个版本; all seem to need this technique.似乎所有人都需要这种技术。 I used to he Handler_reads to verify the results.我习惯用Handler_reads来验证结果。

When I tried with a JOIN but not a derived table, it was touching 2*N rows, where N was the number of rows without the LIMIT .当我尝试使用JOIN而不是派生表时,它触及 2*N 行,其中 N 是没有LIMIT的行数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM