简体   繁体   中英

mySQL - Large metrics table and heavy query performance - Caching?

I've got a large database, quite rapidly expanding and I've got a number of busy tables, logging every aspect of user's behaviour.

At the moment, I have a studio where users can see this usage and behaviour obviously displayed in charts, etc. etc. The thing is, it's seriously intensive to load this stuff now. Had a project that had usage of 80,000 people and it takes an age to load the stats.

Now, the tables are quite well structured and indexed on joins etc. I've had advice and sought learning along the way for best practice to try and help best prepare for this data size. But, without much more scope in query/table optimisation how else can I speed up this intensive process? .

I notice most analytics and such allow you to view up until yesterday by default. Does that help?

  1. Does this mean the statistics can be cached by query_cache on mysql? If the query constantly ends tomorrow (thereby counting today's stats), will it not cache?
  2. Is it more sensible to compile static XMLs etc. each hour that can be referenced, instead of doing queries each time?
  3. How else?

Any thoughts very much welcome.

You'd want to split things up into two databases. One optimized for insertion to capture the data. And a second one optimized for data retrieval. You can't do this with one single database handling both tasks. Optimizing for heavy data insertion means reducing to absolute bare mininum the amount of indexing done (basically just primary keys), and removing keys kills performance when it comes time to do the data mining.

So... two databases. Capture all the data into the insert-optimized one. Then have a scheduled job slurp over the day's data capture into the other database, and run your analyses there.

As a side effect, this where the "up until yesterday" restriction comes from. Today's data won't be available as it's in a separate database.

If you dont need to show realtime results ; You can cache results to Memcache, APC, Redis or equilevent with expire cache after one day.

Mysql will be cache results to query_cache. But you dont remember mysql clears query_cache when table/row was changed. And its having a limited size.

Is extra hardware out of the question? Replicating the data to a few slaves would probably speed things up in this situation. You could also use a version Mark B's suggestion for splitting the database by only updating the slaves at off peak times, overnight for example.

Marc B is right - you want to separate your data capture from your analytics/reporting system.

The conventional name for this is "data warehouse", or similar. These tend to have very different schemas to your production database - highly denormalized, or multi-dimensional "star" schemas are common.

If you see your product growing continuously, you may want to make the jump right now - but it's a whole new skill and technology set, so you might want to take baby steps.

In either case, run your data collection and reporting databases on physically separate hardware. If you do go the data warehouse route, budget for lots of disk space.

You don't say exactly how big the tables are, what kind of tables they are, how the are being populated and how they are being used. So, I'm just going to give some random thoughts :)

When you are reporting over large amounts of data, you are basically limited to the speed of your disk system, ie at what rate your disks deliver the data to MySQL. This rate is usually measured in megabytes/second. So if you can get 100mb/s, then you cannot perform a select sum() or count(*) on a table bigger than 100mb if you want subsecond response time (completely ignoring the DB cache for a moment). Please note that 100mb would be something like 20 million records with a rowsize of 50 bytes.
This works up to a point and then everything just dies. Usually when the size of the database becomes larger than available memory and the number of concurrent users increases.

You will want to investigate the possibility to create aggregate tables , so that you can reduce the nr of megabytes you need to scan through. It can best be explained by an example. Say that your current measure table looks something like this:

measures(
   user_id 
  ,timestamp
  ,action
)

For every single action performed (logged in, logged out, clicked this, farted, clicked that) you store the ID of the user and the timestamp when it happened.

If you want to plot the daily nr of logins from the start of the year, you would have to perform a count(*) over all 100,000,000 million rows and group by the day(timestamp) .

Instead, you could provide a precalculated table such as:

daily_actions(
  day
 ,action
 ,occured
 ,primary key(day, action)
)

That table would typically be loaded with something like:

select day(timestamp)
      ,action
      ,count(*)
  from measures
 group
    by day(timestamp)
      ,action

If you had 100 possible actions, you would need only 36,500 rows to store the activities of an entire year. Users running statistics, charts, reports and what not on that data wouldn't be any heavier than your typical OLTP transactions. Of course, you could store it on hourly basis as well (or instead) and arrive at 876,000 rows for a year. You can also report on weekly, monthly, tertial or yearly figures using the above table. IF you can group your user actions into categories of actions, say "Fun", "Not so fun", potentially harmful" and "flat out wrong" you could reduce the storage further from 100 possible actions, down to 4.

Obviously, your data is more complicated than this, but you can almost always come up with a suitable nr of aggregate tables that can answer almost any question on an high aggregate level. Once you have "drilled down" through the aggregate tables, you can use all those filters and then you might find it is very possible to select against the lowest detailed table using a specific date , and a specific action .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM