简体   繁体   中英

Indexing table with duplicates MySQL/SQL Server with millions of records

I need help in indexing in MySQL. I have a table in MySQL with following rows:

ID Store_ID Feature_ID Order_ID Viewed_Date Deal_ID IsTrial
The ID is auto generated. Store_ID goes from 1 - 8. Feature_ID from 1 - let's say 100. Viewed Date is Date and time on which the data is inserted. IsTrial is either 0 or 1.
You can ignore Order_ID and Deal_ID from this discussion.

There are millions of data in the table and we have a reporting backend that needs to view the number of views in a certain period or overall where trial is 0 for a particular store id and for a particular feature.

The query takes the form of:

select count(viewed_date) 
from theTable
where viewed_date between '2009-12-01' and '2010-12-31'
and store_id = '2' 
and feature_id = '12'
and Istrial = 0

In SQL Server you can have a filtered index to use for Istrial. Is there anything similar to this in MySQL? Also, Store_ID and Feature_ID have a lot of duplicate data. I created an index using Store_ID and Feature_ID. Although this seems to have decreased the search period, I need better improvement than this. Right now I have more than 4 million rows. To search for a particular query like the one above, it looks at 3.5 million rows in order to give me the count of 500k rows.

PS. I forgot to add view_date filter in the query. Now I have done this.

If you need to optimize this query specifically in MySQL, why not add istrial to the end of the existing index on Store_ID and Feature_ID. This will completely index away the WHERE clause and will be able to grab the COUNT from the cardinality summary of the index if the table is MyISAM. All of your existing queries that leverage the current index will be unchanged as well.

edit: also, I'm unsure of why you're doing COUNT(viewed_date) instead of COUNT(*) ? Is viewed_date ever NULL? If not, you can just use the COUNT(*) which will eliminate the need to go to the .MYD file if you take it in conjunction with my other suggestion.

Well you could expand your index to consist of Store_ID, Feature_ID and IsTrial. You won't get any better than this, performancewise.

My first idea would be an index on (feature_id, store_id, istrial), since feature_id seems to be the column with the highest Shannon entropy. But without knowing the statistics on feature_id i'm not sure. Maybe you should better create two indexes, (store_id, feature_id, istrial) being the other and let the optimizer sort it out. Using all three columns also has the advantage of the database being able to answer your query from the index alone, which should improve performance, too.

But if neither of your columns is selective enough to sufficiently improve index performance, you might have to resort to denormalization by using INSERT/UPDATE triggers to fill a second table (feature_id, store_id, istrial, view_count). This would slow down inserts and updates, of course...

You might want to think about splitting that table horizontally. You could run a nightly job that puts each store_id in a separate table. Or take a look at feature_id, yeah, it's a lot of tables but if you don't need real-time data. It's the route I would take.

The best way I found in tackling this problem is to skip DTA's recommendation and do it on my own in the following way:

  • Use Profiler to find the costliest queries in terms of CPU usage (probably blocking queries) and apply indexes to tables based on those queries. If the query execution plan can be changed to decrease the Read, Writes and overall execution time, then first do that. If not, in which case the query is what it is, then apply clustered/non-clustered index combination to best suit. This depends on the nature of the existing table indexes, the bytes total of columns participating in index, etc.
  • Run queries in the SSMS to find the most frequently executing queries and do the same as above.
  • Create a defragmentation schedule in order to either Reorganize or Rebuild indexes depending on how much fragmented they are.

I am pretty sure others can suggest good ideas. Doing these gave me good results. I hope someone can use this help. I think DTA does not really make things faster in terms of indexing because you really need to go through what all indexes it is going to create. This is more true for a database that gets hit a lot.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM