简体   繁体   中英

mySQL - Query to count rows and total percentage too slow

Table named 'log', currently it has 50 million rows:

| id     | domainIP        |
| foo    | 158.132.34.5    |
| bob    | 128.12.244.3    |
| bob    | 128.12.244.3    |
| bob    | 19.152.134.4    |
| bob    | 168.152.34.9    |
| alice  | 178.132.64.10   |
| alice  | 188.152.214.200 |
| peter  | 208.162.36.153  |
| peter  | 208.162.36.153  |
| peter  | 208.162.36.153  |
| peter  | 198.168.94.201  |

I have the following query, to get the number of times id was used with each 'domainIP', and the percentage of each:

SELECT
    `log`.`id`,
    `log`.`domainIP`,
    COUNT(`log`.`domainIP`) AS "Times",
    totalsTable.Totals,
    (COUNT(`log`.`domainIP`)/totalsTable.Totals)*100 AS "Percentage"
FROM `log`
JOIN
    (
    SELECT
        `id`,
        COUNT(`domainIP`) AS Totals
    FROM `log` GROUP BY `id`
    ) AS totalsTable

ON (`log`.`id` = totalsTable.`id`)

GROUP BY `log`.`domainIP` ORDER BY `log`.`id` ASC, "Percentage"  DESC

It returns:

| id     | domainIP        | Times | Totals | Percentage
| foo    | 158.132.34.5    | 1     | 1      | 100
| bob    | 128.12.244.3    | 2     | 4      | 50
| bob    | 19.152.134.4    | 1     | 4      | 25
| bob    | 168.152.34.9    | 1     | 4      | 25
| alice  | 178.132.64.10   | 1     | 2      | 50
| alice  | 188.152.214.200 | 1     | 2      | 50
| peter  | 208.162.36.153  | 3     | 4      | 75
| peter  | 198.168.94.201  | 1     | 4      | 25

The result is exactly I need, but it's unusable slow (takes several minutes).

Here's the table structure exported from phpmyadmin.

CREATE TABLE `log` (
  `id` varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL,
  `eDate` datetime DEFAULT NULL,
  `domainIP` varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL,
  `event` varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

ALTER TABLE `log`
  ADD UNIQUE KEY `logUnique` (`id`,`eDate`,`event`),
  ADD KEY `eDate` (`eDate`),
  ADD KEY `id` (`id`,`eDate`),
  ADD KEY `event` (`id`,`eDate`,`event`);

Results of EXPLAIN query on a smaller version of the table:

id | select_type | table | type  | possible_keys      | key       | key_len | ref            | rows  | Extra
1 | PRIMARY | <derived2> | ALL   | NULL               | NULL      | NULL    | NULL           | 100   | Using where; Using temporary; Using filesort 
1 | PRIMARY | log        | ref   | logUnique,id,event | logUnique | 453     | totalsTable.id | 1     |  
2 | DERIVED | log        | index | NULL               | id        | 459     | NULL           | 100   |

I need to formulate a query that returns the same thing but that is usable (returns results in a manner of seconds, not minutes), but don't know how

Note: adding an index to domainIP only slightly improves the response of a small size sample, but the full table still takes more than 10 minutes to return the result.

The table was created for other purposes, and i'd prefer to modify it's structure the least possible if anything at all.

You may find that this is a bit faster. Start with this version:

SELECT l.id, l.domainIP, COUNT(*) as Times,
       (SELECT COUNT(*) FROM log l2 WHERE l2.id = l.id) as Total
FROM log l
GROUP BY l.id, l.domainIP
ORDER BY l.id ASC;

Your existing index starting with id should be sufficient.

Actually, you can even remove the correlated subquery to measure the performance of just the GROUP BY . If it is not good enough, then you basically know that you cannot improve your more complicated query. You will need to try some other method, such as using triggers to maintain the total counts.

Looking briefly, it's not a surprise that queries take such huge time, because there's varchar non-unique id and varchar domainIP . String comparing can be slower than comparing int fields in many orders of magnitude. You should consider to make denormalization:

  1. id field must unique identifier, longint , for example;
  2. you should declare table like user_names of id and user_name . Then you should declare table like 'user_ips' consisting of id , user_id (which is actually an id from user_names ) and domainIP .

Only this few changes must increase query speed significantly. Hopefully, this will help you a bit

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM