简体   繁体   中英

MySQL query execution takes time with a single large table?

I have made a scraping script to download information from a certain websites into a database,which is used further for monitoring historic listing information and their aggregate counts.

Here is the structure of the table:

    CREATE TABLE IF NOT EXISTS `biz_listing` (
          `id` bigint(11) NOT NULL,
          `lid` bigint(11) NOT NULL,
          `cid` bigint(11) NOT NULL,
          `name` varchar(300) NOT NULL,
          `type` enum('homeservices','restaurants') NOT NULL,
          `location` varchar(300) NOT NULL,
          `businessID` varchar(300) NOT NULL,
          `reviewcount` int(6) NOT NULL,
          `rating` decimal(10,1) NOT NULL,
          `city` varchar(300) NOT NULL,
          `categories` varchar(300) NOT NULL,
          `result_month` varchar(10) NOT NULL,
          `updated_date` date NOT NULL,
          KEY `businessID` (`businessID`),
          KEY `updated_date` (`updated_date`)
        ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

The script has collected about 3.5 million results so far,but currently due to the large number of records in the table the script is taking large time in query execution and results in time out issues.We have certain queries to make reports based on the results populated.The scraping script is live and is populating results,but currently I cannot make reports based on the aggregate functions.

For reference ,here is the query used for aggregate reports:

SELECT 
COUNT(t.`type`) AS count,
COUNT(t.`businessID`) AS bizcount, 
SUM(t.reviewcount) AS reviewcount,
t.`type`,t.`location` as city 
FROM `biz_listing` t 
INNER JOIN ( SELECT `businessID`,count(*) c 
FROM `biz_listing` 
where 
DATE_FORMAT(`updated_date`, '%m %Y') 
BETWEEN '01 2014' AND '02 2014' 
group by `businessID` HAVING c = 2 ) t2 ON t2.`businessID` = t.`businessID` 
where DATE_FORMAT(t.`updated_date`, '%m %Y')= '01 2014' 
and t.type='homeservices' 
GROUP BY t.location, t.result_month

The above query is used to get a location wise report of business listing counts and their review counts. Here the listing shows aggregate report of businesses common on Jan 2014 and Feb 2014 in the database.

Now query execution from the table biz_listing is taking much time and often the process fails.

EXPLAIN

在此输入图像描述

Does storing all the data in a single table is the reason for this ? The current script is set to continue scraping information to the same table itself. I can't bear lose of any data ,also I should make the report making query faster.

In some forums I found that table size is not an issue in these kind of cases and a proper partitioning would help. Since I'm concerned on the data,I'm confused and worried about about making experiments.

Since the table is supposed to have more records later on,does partitioning of table could help me. I got the idea of partitioning just from the reference documents and I confused on how to implement it?

Any suggestions or advise is highly appreciable.I could also provide any supporting information ,if necessary.?

First thing to do is to remove DATE_FORMAT and just check the dates:-

SELECT 
    COUNT(t.`type`) AS count,
    COUNT(t.`businessID`) AS bizcount, 
    SUM(t.reviewcount) AS reviewcount,
    t.`type`,
    t.`location` as city 
FROM `biz_listing` t 
INNER JOIN 
( 
    SELECT `businessID`,count(*) c 
    FROM `biz_listing` 
    WHERE updated_date BETWEEN '2014/01/01' AND '2014/02/28' 
    GROUP BY `businessID` 
    HAVING c = 2 
) t2 ON t2.`businessID` = t.`businessID` 
WHERE updated_date BETWEEN '2014/01/01' AND '2014/02/28' 
AND t.type='homeservices' 
GROUP BY t.location, t.result_month

Down side of that is you have to specify the last day of the month. You can over come that using LAST_DAY:-

SELECT 
    COUNT(t.`type`) AS count,
    COUNT(t.`businessID`) AS bizcount, 
    SUM(t.reviewcount) AS reviewcount,
    t.`type`,
    t.`location` as city 
FROM `biz_listing` t 
INNER JOIN 
( 
    SELECT `businessID`,count(*) c 
    FROM `biz_listing` 
    WHERE updated_date BETWEEN '2014/01/01' AND LAST_DAY('2014/02/01')
    GROUP BY `businessID` 
    HAVING c = 2 
) t2 ON t2.`businessID` = t.`businessID` 
WHERE updated_date BETWEEN '2014/01/01' AND LAST_DAY('2014/02/01')
AND t.type='homeservices' 
GROUP BY t.location, t.result_month

Note that as it is acting on a constant LAST_DAY will be execute once for each time in the query rather than once for each row it is checking.

You probably want to add a covering index on type and update_date on the table as well (ie, one index that has both columns). Similarly add an index covering both businessID and update_date.

EDIT

Looking at your query again, it looks like you are looking for matches on a business id one month which has a record on that month and the next month. If I understand what you want each business can only have 1 record each month (hence you counted them over 2 months and used HAVING ... = 2).

If this is correct then you can maybe do multiple joins, one for each month:-

SELECT 
        COUNT(t0.type) AS count,
        COUNT(t0.businessID) AS bizcount, 
        SUM(t0.reviewcount) AS reviewcount,
        t0.type,
        t0.location as city ,
        t0.result_month
FROM biz_listing t0 
INNER JOIN biz_listing t1
ON t0.businessID = t1.businessID
INNER JOIN biz_listing t2
ON t0.businessID = t2.businessID
WHERE t0.updated_date BETWEEN '2014/01/01' AND LAST_DAY('2014/01/01')
AND t1.updated_date BETWEEN '2014/01/01' AND LAST_DAY('2014/01/01')
AND t2.updated_date BETWEEN '2014/02/01' AND LAST_DAY('2014/02/01')
AND t0.type='homeservices' 
GROUP BY t.location, t.type, t.result_month

Note, if I have misunderstood and a businessID can have multiple records each month the this will not work.

请在updated_date上创建数据库表的索引并type列,这将有助于快速执行查询

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM