简体   繁体   中英

MySQL: long running LEFT JOIN query performance

A MySQL database contains two tables: customer and custmomer_orders

The customer table contains 80 million entries and contains 80 fields. Some of them I am interested in:

  1. Id (PK, int(10))
  2. Location (varchar 255, nullable).
  3. Registration_Date (DateTime, nullable). Indexed.

The customer_orders table constains 40 million entries and contains only 3 fields:

  1. Id (PK, int(10))
  2. Customer_Id (int(10), FK to customer table)
  3. Order_Date (DateTime, nullable)

When I run such query, it takes ~800 seconds to execute and returns 40 million entries:

SELECT o.* 
FROM customer_orders o
LEFT JOIN customer c ON (c.Id = o.Customer_Id) 
WHERE NOT (ISNULL(c.Location)) AND c.Registration_Date < '2018-01-01 00:00:00';

Machine with MySQL server has 32GB of RAM, 28GB assigned to MySQL. MySQL version: 5.6.39.

Is it normal for MySQL to execute such query for this amount of time on the tables with such amount of records? How can I improve the performance?

Update:

The customer_orders table does not contain any vital data we would like to store. It is some kind of copied table with orders made within last 10 days. Every day we run a stored procedure, which deletes orders older than 10 days in scope of a transaction.

In some moment of time, this stored procedure ended up with a timeout due to not optimized query, and number of orders was growing every day. Previous query contained also COUNT method, which, I suppose, exceeded the timeout.

Nevertheless, it surprised me, that it can take up to 15 minutes for MySQL to fetch 40m of records with additional conditions.

I think it's normal. It would be helpful if you share what explain returns for that query.

In order to optimize the query, it might not be a good idea to start with customer_orders, as you are not filtering it in anyway (so it's performing a full table scan over 40M records). Also, as pointed in the comments, a LEFT JOIN is not needed here. I would write your query like this:

SELECT o.*
FROM customers c, customer_orders o
WHERE c.id = o.Customer_Id
AND   c.Location IS NOT NULL
AND   c.Registration_Date < '2018-01-01'

This will (depending on how many records satisfy the clause Registration_Date < '2018-01-01' ) filter the customers table first and then join with the customer_orders table which has and index by customer_id

Also, maybe not related but, is it normal for you that the query returns 40M records? I mean, it's like the whole customer_orders table. If I am right that means that all orders are from customer registered before '2018-01-01'

This is to long for a comment...

The first thing to note about your query is that it is not actually performing a LEFT JOIN , since it has conditions in the WHERE clause that refer to the LEFT JOIN ed table.

It could be rewritten as :

SELECT o.* 
FROM customer_orders o
INNER JOIN customer c 
    ON c.Id = o.Customer_Id
    AND c.Location is NOT NULL
    AND c.Registration_Date < '2018-01-01 00:00:00';

Being explicit about the join type is better for readability and may help MySQL to find a better execution path for the query.

When it comes to performance, the basic advice is that, for this query, you would need a compound index on all three columns being searched, in the same sequence as the one being used in the query (usually, you want to put the more restrictive condition at the beginning, so you might want to adjust this) :

ALTER TABLE mytable ADD INDEX (Id, Location, Registration_Date );

For more advices on performance, you might want to update your question with the CREATE TABLE statements of your tables and the execution plan of your query.

If my comment, and GMB's answer don't end up helping performance much; you can always try writing the query with a different approach. I usually prefer joins to subqueries, but occasionally they turn out to be the best option for the data being handled.

Since you've said the customers table is relatively large compared to the orders table, this could be one of those situations.

SELECT o.* 
FROM customer_orders AS o
WHERE o.Customer_Id IN (
     SELECT Id 
     FROM customer 
     WHERE Location IS NOT NULL 
        AND Registration_Date < '2018-01-01 00:00:00'
);

I wanted to put a comment, but changed my mind to go with answer.

Because main issue is your question itself.

I don't know how many columns your customer_orders has, but if you are getting

40 million entries

back. I would say you are doing something wrong. And probably that is not the query itself is slow, but data fetching.

To prove that try to execute EXPLAIN against your query:

EXPLAIN SELECT ...your query here... ;

Then execute

EXPLAIN SELECT ...your query here... LIMIT 1;

Try to LIMIT your results to 1000 for example:

SELECT ...your query here... LIMIT 1000;

When you have answers, outputs and stats for these queries we can discuss your following steps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM