MySQL Query Runs forever

Question

I have a table with over 250 million records. Our Reporting server queries regularly to that table using similar kind of query.

SELECT
    COUNT(*),
    DATE(updated_at) AS date,
    COUNT(DISTINCT INT_FIELD) 
FROM
    TABLE_WITH_250_Million 
WHERE
    Field1 = 'value in CHAR' 
    AND field2 = 'VALUE in CHAR' 
    AND updated_at > '2012-04-27' 
    AND updated_at < '2012-04-28 00:00:00' 
GROUP BY
    Field2,
    DATE(updated_at) 
ORDER BY
    date DESC

I have tried to create a BTREE index on the table including Field1,Field2,Field3 DESC in the same order but its not giving me the right result.

Can anyone help me how do I optimize it. My problem is I can't change the query as I don't have code where this reporting server is executing query from.

Any help would be really appreciated.

Thanks

Here's my table:

CREATE TABLE backup_jobs ( 
  id int(11) unsigned NOT NULL AUTO_INCREMENT, 
  backup_profile_id int(11) DEFAULT NULL, 
  state varchar(32) DEFAULT NULL, 
  limit int(11) DEFAULT NULL, 
  file_count int(11) DEFAULT NULL, 
  byte_count bigint(20) DEFAULT NULL, 
  created_at datetime DEFAULT NULL, 
  updated_at datetime DEFAULT NULL, 
  status_type varchar(32) DEFAULT NULL, 
  status_param_1 varchar(255) DEFAULT NULL, 
  status_param_2 varchar(255) DEFAULT NULL, 
  status_param_3 varchar(255) DEFAULT NULL, 
  started_at datetime DEFAULT NULL,
  PRIMARY KEY (id),
  KEY index_backup_jobs_on_state (state),
  KEY index_backup_jobs_on_backup_profile_id (backup_profile_id),
  KEY index_backup_jobs_created_at (created_at),
  KEY idx_backup_jobs_state_updated_at (state,updated_at) USING BTREE,
  KEY idx_backup_jobs_state_status_param_1_updated_at (state,status_param_1,updated_at) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=508748682 DEFAULT CHARSET=utf8;

Answer 1

I'm sure that all 250M rows didn't occur in the date range of interest.

The problem is that the between nature of the date check forces a table scan, because you can't know where the date falls.

I'd recommend that you partition the 250M row table into weeks, months, quarters, or years and only scan the partitions need for a given date range. You'll only have to scan the partitions within the range. That'll help matters.

If you go down the partition road, you'll need to talk to a MySQL DBA, preferrably someone who's familiar with partioning. It's not for the faint of heart.

http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

Answer 2

Add the int_field into the index:

CREATE INDEX idx_backup_jobs_state_status_param_1_updated_at_backup_profile_id ON backup_jobs (state, status_param_1, updated_at, backup_profile_id)

to make it cover all fields.

This way, table lookups go (you will see Using index in the plan) which will make your query some 10x faster (your mileage may vary).

Also note that (at least for the single-date range provided) GROUP BY DATE(updated_at) and ORDER BY date DESC are redundant and will only make the query to use temporary and filesort without any real purpose. Not that you can do much about it, though, if you cannot change the query.

Answer 3

Per your query, you'll have to take the lead here -- smallest granularity. We have no idea what the frequency is of activity, what the Field1, Field2 status entries are, how far back your data goes, how many entries would be normal on a given SINGLE DATE. All that said, I would build my indexes based on smallest granularity first that closely matches your querying criteria.

Ex: if your "Field1" has a dozen possible "CHAR" values, and you are applying an "IN" clause, and Field1 is first in your index, it will hit each char for each date and field2 value. 250 million records could force a lot of index paging activity especially based on history. Likewise with your Field2. However, due to your "Group By" clause on Field2 and date updated, I would have ONE of those respectively in the first/second position of the index. Based on historical data, I would even tend to shoot at the following index to have dates as the primary basis, and within that, the secondary criteria.

index ( Updated_At, Field2, Field1, INT_FIELD )

This way, your entire query can be done on just the index alone and does not need to query against the raw data of the actual record. All the fields are right there in the index to pull from. You have a finite date range, so your updated_at is right-away qualified, and in order prep of the group by. From that, your "CHAR" values from Field2 will right-along nicely finish your group by. Field1 to qualify your 3rd criteria of "IN" char list, and finally your INT_FIELD for count( distinct ).

Don't know how long the index will take to build on 250 million, but that is where I would start.

MySQL Query Runs forever

Question

3 answers

solution1
0 2012-04-27 15:54:35

solution2
0 2012-04-27 16:12:29

solution3
0 2012-04-27 23:57:22

MySQL Query Runs forever

Question

3 answers

solution1 0 2012-04-27 15:54:35

solution2 0 2012-04-27 16:12:29

solution3 0 2012-04-27 23:57:22

solution1
0 2012-04-27 15:54:35

solution2
0 2012-04-27 16:12:29

solution3
0 2012-04-27 23:57:22