mySQL query for top N entries of a grouped records

Question

I'm new to MySql and to databases in general. I have a query, which I built together via snippets from online resources and trail and error. It is really slow (27sec) and I assume it can be optimized. Maybe someone could help me out with that.

This is the datastructure for my mySQL - Database. Version 5.1.51-0

|- purchaseID -|- customerID -|- emotionID -|- customerCountryCode -|- customerContinentCode-|
|     1        |     2345     |     0       |        US             |            NA          |
|     2        |     2345     |     3       |        US             |            NA          |
|     3        |     4456     |     0       |        UK             |            EU          |
|     3        |     4456     |     5       |        UK             |            EU          |
|     4        |     4456     |     2       |        UK             |            EU          |
|     5        |     4456     |     2       |        UK             |            EU          |
|     6        |     1234     |     0       |        US             |            NA          |
|     7        |     6678     |     0       |        US             |            NA          |
|     8        |     9900     |     0       |        US             |            NA          |
|     9        |     3334     |     0       |        US             |            NA          |    
|     10       |     3334     |     4       |        US             |            NA          |

The database is used to save all the purchases, which are made. For every purchase the customerID , the country and the continent he comes from are saved. The customer also has the possibility to rate his purchase from a set of 6 emotions. (happy, disappointed,...) The emotion he choses is saved as emotionID .

So now I need a query to get me the top 6 costumers for a certain emotionID with a percentage info. Assume I looked for emotionID = 0 this is, what I would like to get:

|- customerID -|- emotionPercent -|
|     1234     |        100       |     
|     6678     |        100       |     
|     9900     |        100       | 
|     2345     |        50        |     
|     3334     |        50        | 
|     4456     |        25        |

I'm using this query:

SELECT customers.customerID, Count( customers.emotionID ) / C.totalPeople * 100.0 AS emotionPercent 
FROM `customers` 
INNER JOIN 

    (SELECT customers.customerID, Count( customers.emotionID ) AS totalPeople
    FROM `customers` 
    GROUP BY customerID) C 

ON customers.customerID = C.customerID 
WHERE customers.emotionID = 0 
GROUP BY customers.customerID 
ORDER BY emotionPercent DESC 
LIMIT 0,6

I have searched for answers, but the additional percentage calculation is throwing me off. I have found some solutions, which would require to populate some sort of temporary table, but I couldn't get it to work.

Problem is: Right now, there are 140,000 entries in the database and this query takes about 27 seconds. Can this be right? Would using a SQL - Server increase the speed significantly?

What I don't get is this: Asking for the happiest country in the world is lightning fast (0.4 seconds), but structurally similar to the first query (27 sec):

SELECT customers.customerCountryCode, Count( customers.emotionID ) / C.totalPeople * 100.0 AS emotionPercent 
FROM `customers` 
INNER JOIN 

    (SELECT customers.customerCountryCode, Count( customers.emotionID ) AS totalPeople
    FROM `customers` 
    GROUP BY customerCountryCode) C 

ON customers.customerCountryCode = C.customerCountryCode 
WHERE customers.emotionID = 0 
GROUP BY customers.customerCountryCode 
ORDER BY emotionPercent DESC 
LIMIT 0,6

When I change the GROUP BY of the INNER Query in this example to customerID , the query also takes forever. So it's the grouping by customerID that's causing the problem. But why?

The customerCountryCode is defined as varchar(2) . The customerID is an int(11) . Is this causing the huge difference in the query performance? Is there some more appropiate varible type? The customerID can have up to 8 numbers.

A lot of questions! Thanks for reading and any help!

Answer 1

first off, if you think that the entries in your database will be ballooning, or if your entries are high and the server slow as it is, IMHO, you would want to preprocess the data and store it to another database with the summarized results, that way, you wouldn't have to request the same process over and over again. Also, try using caching plugins for your app. memcache for php or ehcache on j2ee would be safe bets.

Answer 2

Your problem might be that you are using subqueries. Since subqueries don't use nor set indexes, they use the slowest join-method possible (ie a full table scan). I am not experienced enough to offer an SQL-only solution so I would recommend you break the query down into two separate calls.

Get average emotion for each customer and select top 6, save into hash or object.
Get those 6 customers via WHERE custumerID IN (id1, id2, id3, etc)

Although this probably isn't the prettiest of solutions, you avoid using an index-less subquery (and the very slow full table scan).

Answer 3

Thanks for your help!

The guys from the mySQL forum suggested to add some indices:

ALTER TABLE customers
  ADD KEY idx_country_emid (customerCountryCode, emotionID),
  ADD KEY idx_emid_custid (emotionID, customerID);

The query time dropped from 27 seconds to 0,1 seconds. ;)

Also, for the inner query, you can write

(SELECT customers.customerCountryCode, Count( * ) AS totalPeople
    FROM `customers` 
    GROUP BY customerCountryCode) C

mySQL query for top N entries of a grouped records

Question

3 answers

solution1
0 ACCPTED 2010-12-11 19:38:45

solution2
0 2010-12-13 10:15:17

solution3
0 2010-12-14 13:53:19

mySQL query for top N entries of a grouped records

Question

3 answers

solution1 0 ACCPTED 2010-12-11 19:38:45

solution2 0 2010-12-13 10:15:17

solution3 0 2010-12-14 13:53:19

solution1
0 ACCPTED 2010-12-11 19:38:45

solution2
0 2010-12-13 10:15:17

solution3
0 2010-12-14 13:53:19