简体   繁体   中英

What's the best way to manage a large number of tables in MS SQL Server?

This question is related to another:
Will having multiple filegroups help speed up my database?

The software we're developing is an analytical tool that uses MS SQL Server 2005 to store relational data. Initial analysis can be slow (since we're processing millions or billions of rows of data), but there are performance requirements on recalling previous analyses quickly, so we "save" results of each analysis.

Our current approach is to save analysis results in a series of "run-specific" tables, and the analysis is complex enough that we might end up with as many as 100 tables per analysis. Usually these tables use up a couple hundred MB per analysis (which is small compared to our hundreds of GB, or sometimes multiple TB, of source data). But overall, disk space is not a problem for us. Each set of tables is specific to one analysis, and in many cases this provides us enormous performance improvements over referring back to the source data.

The approach starts to break down once we accumulate enough saved analysis results -- before we added more robust archive/cleanup capability, our testing database climbed to several million tables. But it's not a stretch for us to have more than 100,000 tables, even in production. Microsoft places a pretty enormous theoretical limit on the size of sysobjects (~2 billion), but once our database grows beyond 100,000 or so, simple queries like CREATE TABLE and DROP TABLE can slow down dramatically.

We have some room to debate our approach, but I think that might be tough to do without more context, so instead I want to ask the question more generally: if we're forced to create so many tables, what's the best approach for managing them? Multiple filegroups? Multiple schemas/owners? Multiple databases?

Another note: I'm not thrilled about the idea of "simply throwing hardware at the problem" (ie adding RAM, CPU power, disk speed). But we won't rule it out either, especially if (for example) someone can tell us definitively what effect adding RAM or using multiple filegroups will have on managing a large system catalog.

Without first seeing the entire system, my first recommendation would be to save the historical runs in combined tables with a RunID as part of the key - a dimensional model may also be relevant here. This table can be partitioned for improvement, which will also allow you to spread the table into other filegroups.

Another possibility it to put each run in its own database and then detach them, only attaching them as needed (and in read-only form)

CREATE TABLE and DROP TABLE are probably performing poorly because the master or model databases are not optimized for this kind of behavior.

I also recommend talking to Microsoft about your choice of database design.

Are the tables all different structures? If they are the same structure you might get away with a single partitioned table.

If they are different structures, but just subsets of the same set of dimension columns, you could still store them in partitions in the same table with nulls in the non-applicable columns.

If this is analytic (derivative pricing computations perhaps?) you could dump the results of a computation run to flat files and reuse your computations by loading from the flat files.

This seems to be a very interesting problem/application that you are working with. I would love to work on something like this. :)

You have a very large problem surface area, and that makes it hard to start helping. There are several solution parameters that are not evident in your post. For example, how long do you plan to keep the run analysis tables? There's a LOT other questions that need to be asked.

You are going to need a combination of serious data warehousing, and data/table partitioning. Depending on how much data you want to keep and archive you may need to start de-normalizing and flattening the tables.

This would be pretty good case where contacting Microsoft directly can be mutually beneficial. Microsoft gets a good case to show other customers, and you get help directly from the vendor.

We ended up splitting our database into multiple databases. So the main database contains a "databases" table that refers to one or more "run" databases, each of which contains distinct sets of analysis results. Then the main "run" table contains a database ID, and the code that retrieves a saved result includes the relevant database prefix on all queries.

This approach allows the system catalog of each database to be more reasonable, it provides better separation between the core/permanent tables and the dynamic/run tables, and it also makes backups and archiving more manageable. It also allows us to split our data across multiple physical disks, although using multiple filegroups would have done that too. Overall, it's working well for us now given our current requirements, and based on expected growth we think it will scale well for us too.

We've also noticed that SQL 2008 tends to handle large system catalogs better than SQL 2000 and SQL 2005 did. (We hadn't upgraded to 2008 when I posted this question.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM