MySQL - Very simple Join is taking too long

Question

This is my first question in stackoverflow and I am delighted to be part of this community because it has helped me many times.

I'm not an expert in SQL and MySQL but I'm working in a project that needs large tables (million rows). I have a problem when doing a join and I don't understand why it takes so long . Thanks in advance:)

Here are the tables:

CREATE TABLE IF NOT EXISTS tabla_maestra(
id int UNIQUE,
codigo_alta char(1),
nombre varchar(100),
empresa_apellido1 varchar(150),
apellido2 varchar(50),
tipo_via varchar(20),
nombre_via varchar(100),
numero_via varchar(50),
codigo_via char(5),
codigo_postal char(5),
nombre_poblacion varchar(100),
codigo_ine char(11),
nombre_provincia varchar(50),
telefono varchar(250) UNIQUE,
actividad varchar(100),
estado char(1),
codigo_operadora char(3)
);

CREATE TABLE IF NOT EXISTS tabla_actividades_empresas(
empresa_apellido1 varchar(150),
actividad varchar(100)
);

Here is the query I want to do:

UPDATE tabla_maestra tm
INNER JOIN tabla_actividades_empresas tae
ON (tm.nombre!='' AND tae.empresa_apellido1=tm.empresa_apellido1)
SET tm.actividad=tae.actividad;

This query takes too long, and before executing it I was trying to test how long takes this simplier query:

SELECT COUNT(*) FROM tabla_maestra tm
INNER JOIN tabla_actividades_empresas tae 
ON (tm.nombre!='' AND tae.empresa_apellido1=tm.empresa_apellido1);

It is still taking too long, and I don't understand why. Here are the indexes I use:

CREATE INDEX cruce_nombre
USING HASH
ON tabla_maestra (nombre);

CREATE INDEX cruce_empresa_apellido1
USING HASH
ON tabla_maestra (empresa_apellido1);

CREATE INDEX index_actividades_empresas
USING HASH
ON tabla_actividades_empresas(empresa_apellido1);

If I use the EXPLAIN statement, these are the results:

http://oi59.tinypic.com/2zedoy0.jpg

I would be so grateful to receive any answer that could help me. Thanks a lot, Dani.

Answer 1

A join involving half a million rows -- as your query plan shows -- is bound to take some time. The count(*) query is quicker because it doesn't need to read the tabla_maestra table itself, but it still needs to scan all the rows of index cruce_empresa_apellido1 .

It might help some if you made index index_actividades_empresas a unique index (supposing that that's indeed appropriate) or if instead you drop that index and make column empresa_apellido1 a primary key of table tabla_actividades_empresas .

If even that does not give you sufficient performance, then the only other thing I see to do is to give table tabla_actividades_empresas a synthetic primary key of integer type, and to change the corresponding column of tabla_maestra to match. That should help because comparing an integer to an integer is faster than comparing a string to a string, even when you can filter out (most) mismatches via a hash.

Answer 2

I agree with the other ones (see John Bollinger ie) about the lack of Primary Keys on it. It's highly adiviced for IDs (I noticed you worry about it be repeated, but PK smoothly treats it too - I meant MySQL's AUTOINCREMENT).

Why do you use the tabla_actividades_empresas . empresa_apellido1 instead of look for tabla_maestra's ID to be referenced in?

If so, you could define Foreign Key to it: tabla_actividades_empresas . maestra_id ie

Because it gets better if you associate tables with non-strings types.

You also can subquery the tables before the JOIN action between them. It's an example:

UPDATE (SELECT * FROM tabla_maestra WHERE nombre != '') AS tm
INNER JOIN tabla_actividades_empresas AS tae
ON tae.empresa_apellido1 = tm.empresa_apellido1
SET tm.actividad = tae.actividad;

I have not tested it. But it seems to be a nice behavior to follow since then.

Oh... everytime do you need to update all the data rows? Unless, you can update only the forgotten ones. You can apply the UPDATE by INNER JOIN after one LEFT JOIN to determine the needed ones to be updated. Does it have sense? I'm not any expert, but it can be useful to think about.

EDIT

You may test some subquery too:

UPDATE tabla_maestra AS main, tabla_actividades_empresas AS aggr
SET main.actividad = aggr.actividad
WHERE main.empresa_apellido1 = aggr.empresa_apellido1
AND main.nombre <> ''

Don't forget to try of adjusting the relationship.

Answer 3

Thank you so much for your answers.

The fact is that table ' tabla_maestra ' is a table that contain information about enterprises, but does't contain the values for the ' actividad ' field (activity of the enterprise). Moreover, the ' id ' field is still empty (I will it in a future. It is difficult to explain why, but it has to be done this way).

I need to add the activity of each enterprise joining with an auxiliar table ' tabla_actividades_empresas ', which contain the activity for each enterprise name. And I only have to do it one time, no more. Then I will be able to drop the table ' tabla_actividades_empresas ' because I won't need it.

And the only way to join them is by the field ' empresa_apellido1 ', it is to say, the name of the enterprise.

I have made the field ' tabla_actividades_empresas.empresa_apellido1 ' unique, but it doesn't improve the performance.

And it doesn't have sense to define a foreign key on ' tabla_actividades_empresas ' because the field ' empresa_apellido1 ' is UNIQUE only for the ' tabla_actividades_empresas ', not for the ' tabla_maestra ' (in this table, an enterprise name can appear many times because enterprises can have different offices in different places). It is to say, ' tabla_actividades_empresas ' doesn't contain repeated enterprises, but ' tabla_maestra ' has repeated name enterprises.

By the way, what do you mean by "adjusting the relationship"? I have tried your subqueries with the explain statement, and it doesn't use the indexes correctly, the performance is worse.

MySQL - Very simple Join is taking too long

Question

3 answers

solution1
1 2014-10-17 14:03:28

solution2
1 2014-10-17 14:40:07

solution3
0 2014-10-20 08:31:31

MySQL - Very simple Join is taking too long

Question

3 answers

solution1 1 2014-10-17 14:03:28

solution2 1 2014-10-17 14:40:07

solution3 0 2014-10-20 08:31:31

solution1
1 2014-10-17 14:03:28

solution2
1 2014-10-17 14:40:07

solution3
0 2014-10-20 08:31:31