简体   繁体   中英

Optimizing MySQL query where the number of returned rows is very large

Context: We have a website where users(merchants) can add their apps/websites into the system and pay their users via API. Now, the problem comes when we have to show the list of those transactions to the merchant on their dashboard. Each merchant generates hundreds of transactions per second and on average merchants have like 2 million transactions per day and on the dashboard, we have to show today's stats to the merchant.

Main Problem: We have to show today's transactions to the merchant which is around 2 million records for a single merchant. So a query like this,

SELECT * FROM transactions WHERE user_id = 123 LIMIT 0,15

Rows examined are 2 million in our example and that cannot be reduced in any way. The limit doesn't help here I think, because MySQL will still examine all rows an then pick first 15 from the result set.

How can we optimize queries like this where we have to show millions of records(with pagination of course) to the user?

Edit:

Explain output:

在此处输入图片说明

Query:

explain select a.id, a.user_app_id, a.created_at, a.type, a.amount, a.currency_id, b.name, b.url from transactions as a left join user_apps as b on a.user_app_id = b.id where a.sender_user_id = ? and a.created_at BETWEEN '2020-03-20' AND '2020-03-21' order by a.created_at desc limit 15 offset 0

Details:

Index sender_user_id_2 is an composite index of sender_user_id(int) and created_at(timestamp) column.

This query is taking 5 to 15 seconds to return 15 rows.

If I run the same query for the sender_user_id which has only 24 transactions in the table, then the response in instant.

First, let's fix what might be a bug: You are including two midnights in that "day". BETWEEN is "inclusive".

 AND  a.created_at BETWEEN '2020-03-20' AND '2020-03-21'

-->

 AND  a.created_at >= '2020-03-20'
 AND  a.created_at  < '2020-03-20' + INTERVAL 1 DAY

(There is no performance change, just the elimination of tomorrow's midnight.)

In your simple query, only 15 rows will be touched due to the LIMIT . However, for more complex queries it may need to gather all rows, sort them, and only then peel off 15 rows. The technique for preventing that inefficiency goes something like this: Devise, if possible, an INDEX that handles all of the WHERE and the ORDER BY .

    where  a.sender_user_id = ?
      AND  a.created_at >= '2020-03-20'
      AND  a.created_at  < '2020-03-20' + INTERVAL 1 DAY
    order by  a.created_at desc

needs INDEX(sender_user_id, created_at) -- in that order. (And, in your query, nothing else encroaches on that.)

Pagination via OFFSET introduces another performance problem -- it must step over all OFFSET rows before getting the ones you want. This is solvable by remembering where you left off .

So, why does EXPLAIN think it will hit a million rows? Because Explain is dumb when it comes to handling LIMIT . There is a better way to estimate the effort. That will show 15, not a million, if all is working well. For LIMIT 150, 15 , it will show 165.

You said "Index sender_user_id_2 is an composite index of sender_user_id(int) and created_at(timestamp) column." Can you provide SHOW CREATE TABLE so we can check for something else subtle going on?

Hmmm... I wonder if

order by  a.created_at desc

should be changed to match the index:

order by a.sender_user_id DESC, a.created_at desc

(What version of MySQL are you using? I did some experimenting and found no difference because of having (or not) sender_user_id in the `ORDER BY.)

(Trouble -- It seems that the JOIN prevents the effective use of LIMIT . Still digging...)

New suggestion:

select  a.id, a.user_app_id, a.created_at, a.type, a.amount, a.currency_id,
        b.name, b.url
    from  
    (
        SELECT  a1.id
            FROM  transactions as a1
            where  a1.sender_user_id = ?
              AND  a.created_at >= '2020-03-20'
              AND  a.created_at  < '2020-03-20' + INTERVAL 1 DAY
            order by  a1.created_at desc
            limit  15 offset 0 
    ) AS x
    JOIN  transactions AS a USING(id)
    left join  user_apps as b  ON x.user_app_id = b.id 

This uses a generic 'trick' to move the LIMIT into a derived table, with minimal other stuff. Then, with only 15 ids, the JOINs to other tables goes 'fast'.

In my experiment (with a different pair of tables), it touched only 5*15 rows. I checked multiple versions; all seem to need this technique. I used to he Handler_reads to verify the results.

When I tried with a JOIN but not a derived table, it was touching 2*N rows, where N was the number of rows without the LIMIT .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM