I need help in indexing in MySQL. I have a table in MySQL with following rows:
ID Store_ID Feature_ID Order_ID Viewed_Date Deal_ID IsTrial
The ID is auto generated. Store_ID goes from 1 - 8. Feature_ID from 1 - let's say 100. Viewed Date is Date and time on which the data is inserted. IsTrial is either 0 or 1.
You can ignore Order_ID and Deal_ID from this discussion.
There are millions of data in the table and we have a reporting backend that needs to view the number of views in a certain period or overall where trial is 0 for a particular store id and for a particular feature.
The query takes the form of:
select count(viewed_date)
from theTable
where viewed_date between '2009-12-01' and '2010-12-31'
and store_id = '2'
and feature_id = '12'
and Istrial = 0
In SQL Server you can have a filtered index to use for Istrial. Is there anything similar to this in MySQL? Also, Store_ID and Feature_ID have a lot of duplicate data. I created an index using Store_ID and Feature_ID. Although this seems to have decreased the search period, I need better improvement than this. Right now I have more than 4 million rows. To search for a particular query like the one above, it looks at 3.5 million rows in order to give me the count of 500k rows.
PS. I forgot to add view_date filter in the query. Now I have done this.
If you need to optimize this query specifically in MySQL, why not add istrial to the end of the existing index on Store_ID and Feature_ID. This will completely index away the WHERE clause and will be able to grab the COUNT from the cardinality summary of the index if the table is MyISAM. All of your existing queries that leverage the current index will be unchanged as well.
edit: also, I'm unsure of why you're doing COUNT(viewed_date)
instead of COUNT(*)
? Is viewed_date
ever NULL? If not, you can just use the COUNT(*)
which will eliminate the need to go to the .MYD file if you take it in conjunction with my other suggestion.
Well you could expand your index to consist of Store_ID, Feature_ID and IsTrial. You won't get any better than this, performancewise.
My first idea would be an index on (feature_id, store_id, istrial), since feature_id seems to be the column with the highest Shannon entropy. But without knowing the statistics on feature_id i'm not sure. Maybe you should better create two indexes, (store_id, feature_id, istrial) being the other and let the optimizer sort it out. Using all three columns also has the advantage of the database being able to answer your query from the index alone, which should improve performance, too.
But if neither of your columns is selective enough to sufficiently improve index performance, you might have to resort to denormalization by using INSERT/UPDATE triggers to fill a second table (feature_id, store_id, istrial, view_count). This would slow down inserts and updates, of course...
You might want to think about splitting that table horizontally. You could run a nightly job that puts each store_id in a separate table. Or take a look at feature_id, yeah, it's a lot of tables but if you don't need real-time data. It's the route I would take.
The best way I found in tackling this problem is to skip DTA's recommendation and do it on my own in the following way:
I am pretty sure others can suggest good ideas. Doing these gave me good results. I hope someone can use this help. I think DTA does not really make things faster in terms of indexing because you really need to go through what all indexes it is going to create. This is more true for a database that gets hit a lot.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.