简体繁体 English

在MS SQL Server中管理大量表的最佳方法是什么？

[英]What's the best way to manage a large number of tables in MS SQL Server?

原文 2008-09-23 22:10:05 9 4 sql-server/ performance/ scalability

This question is related to another: 这个问题与另一个问题有关：
Will having multiple filegroups help speed up my database? 有多个文件组会有助于加速我的数据库吗？

The software we're developing is an analytical tool that uses MS SQL Server 2005 to store relational data. 我们正在开发的软件是一个分析工具，它使用MS SQL Server 2005来存储关系数据。 Initial analysis can be slow (since we're processing millions or billions of rows of data), but there are performance requirements on recalling previous analyses quickly, so we "save" results of each analysis. 初步分析可能很慢（因为我们正在处理数百万或数十亿行数据），但是对于快速调用以前的分析存在性能要求，因此我们“保存”每个分析的结果。

Our current approach is to save analysis results in a series of "run-specific" tables, and the analysis is complex enough that we might end up with as many as 100 tables per analysis. 我们当前的方法是将分析结果保存在一系列“特定于运行”的表中，并且分析非常复杂，每次分析最多可能有100个表。 Usually these tables use up a couple hundred MB per analysis (which is small compared to our hundreds of GB, or sometimes multiple TB, of source data). 通常，这些表每次分析耗尽几百MB（与我们的数百GB，有时甚至是多TB的源数据相比，这个表很小）。 But overall, disk space is not a problem for us. 但总的来说，磁盘空间对我们来说不是问题。 Each set of tables is specific to one analysis, and in many cases this provides us enormous performance improvements over referring back to the source data. 每组表都特定于一个分析，在许多情况下，这为我们提供了超过回溯源数据的巨大性能改进。

The approach starts to break down once we accumulate enough saved analysis results -- before we added more robust archive/cleanup capability, our testing database climbed to several million tables. 一旦我们积累了足够的保存分析结果，这种方法就会开始崩溃 - 在我们添加更强大的归档/清理功能之前，我们的测试数据库已经攀升到数百万个表。 But it's not a stretch for us to have more than 100,000 tables, even in production. 但即使在生产中，我们也不能拥有超过100,000张桌子。 Microsoft places a pretty enormous theoretical limit on the size of sysobjects (~2 billion), but once our database grows beyond 100,000 or so, simple queries like CREATE TABLE and DROP TABLE can slow down dramatically. 微软对sysobjects的大小（约20亿）进行了相当大的理论限制，但是一旦我们的数据库增长到100,000以上，像CREATE TABLE和DROP TABLE这样的简单查询就会大大减慢。

We have some room to debate our approach, but I think that might be tough to do without more context, so instead I want to ask the question more generally: if we're forced to create so many tables, what's the best approach for managing them? 我们有一些空间来讨论我们的方法，但我认为如果没有更多的背景可能很难做到，所以我想更广泛地提出这个问题：如果我们被迫创建这么多表，那么最好的管理方法是什么他们？ Multiple filegroups? 多个文件组？ Multiple schemas/owners? 多个架构/所有者？ Multiple databases? 多个数据库？

Another note: I'm not thrilled about the idea of "simply throwing hardware at the problem" (ie adding RAM, CPU power, disk speed). 另一个注意事项：我对于“简单地将硬件投入问题”（即添加RAM，CPU功率，磁盘速度）的想法并不感到兴奋。 But we won't rule it out either, especially if (for example) someone can tell us definitively what effect adding RAM or using multiple filegroups will have on managing a large system catalog. 但我们也不会排除它，特别是如果（例如）某人可以明确地告诉我们添加RAM或使用多个文件组对管理大型系统目录会产生什么影响。

4 个解决方案

Without first seeing the entire system, my first recommendation would be to save the historical runs in combined tables with a RunID as part of the key - a dimensional model may also be relevant here. 在没有首先看到整个系统的情况下，我的第一个建议是将历史运行保存在组合表中，并将RunID作为键的一部分 - 维度模型也可能与此相关。 This table can be partitioned for improvement, which will also allow you to spread the table into other filegroups. 可以对此表进行分区以进行改进，这也允许您将表扩展到其他文件组中。

Another possibility it to put each run in its own database and then detach them, only attaching them as needed (and in read-only form) 另一种可能性是将每次运行放在自己的数据库中然后分离它们，只根据需要附加它们（并且以只读形式）

CREATE TABLE and DROP TABLE are probably performing poorly because the master or model databases are not optimized for this kind of behavior. CREATE TABLE和DROP TABLE可能表现不佳，因为主数据库或模型数据库未针对此类行为进行优化。

I also recommend talking to Microsoft about your choice of database design. 我还建议您与Microsoft讨论您选择的数据库设计。

Are the tables all different structures? 表格是否都是不同的结构？ If they are the same structure you might get away with a single partitioned table. 如果它们是相同的结构，您可能会使用单个分区表。

If they are different structures, but just subsets of the same set of dimension columns, you could still store them in partitions in the same table with nulls in the non-applicable columns. 如果它们是不同的结构，但只是同一组维度列的子集，您仍然可以将它们存储在同一个表中的分区中，并且在不适用的列中使用空值。

If this is analytic (derivative pricing computations perhaps?) you could dump the results of a computation run to flat files and reuse your computations by loading from the flat files. 如果这是分析（衍生定价计算可能？），您可以将计算运行的结果转储到平面文件，并通过从平面文件加载来重用您的计算。

This seems to be a very interesting problem/application that you are working with. 这似乎是您正在使用的一个非常有趣的问题/应用程序。 I would love to work on something like this. 我很乐意在这样的事情上工作。 :) :)

You have a very large problem surface area, and that makes it hard to start helping. 你有一个非常大的问题表面区域，这使得很难开始帮助。 There are several solution parameters that are not evident in your post. 您的帖子中有几个解决方案参数不明显。 For example, how long do you plan to keep the run analysis tables? 例如，您计划保留运行分析表多长时间？ There's a LOT other questions that need to be asked. 还有很多其他问题需要提出。

You are going to need a combination of serious data warehousing, and data/table partitioning. 您将需要结合严格的数据仓库和数据/表分区。 Depending on how much data you want to keep and archive you may need to start de-normalizing and flattening the tables. 根据您要保留和存档的数据量，您可能需要开始对表进行反规范化和展平。

This would be pretty good case where contacting Microsoft directly can be mutually beneficial. 这是非常好的情况，直接联系Microsoft可以互惠互利。 Microsoft gets a good case to show other customers, and you get help directly from the vendor. 微软得到了向其他客户展示的好例子，您可以直接从供应商处获得帮助。

We ended up splitting our database into multiple databases. 我们最终将数据库拆分为多个数据库。 So the main database contains a "databases" table that refers to one or more "run" databases, each of which contains distinct sets of analysis results. 因此，主数据库包含一个“数据库”表，该表引用一个或多个“运行”数据库，每个数据库包含不同的分析结果集。 Then the main "run" table contains a database ID, and the code that retrieves a saved result includes the relevant database prefix on all queries. 然后主“运行”表包含数据库ID，检索保存结果的代码包括所有查询的相关数据库前缀。

This approach allows the system catalog of each database to be more reasonable, it provides better separation between the core/permanent tables and the dynamic/run tables, and it also makes backups and archiving more manageable. 这种方法允许每个数据库的系统目录更加合理，它可以更好地分离核心/永久表和动态/运行表，还可以使备份和归档更易于管理。 It also allows us to split our data across multiple physical disks, although using multiple filegroups would have done that too. 它还允许我们跨多个物理磁盘分割数据，尽管使用多个文件组也可以这样做。 Overall, it's working well for us now given our current requirements, and based on expected growth we think it will scale well for us too. 总的来说，考虑到我们目前的要求，它现在对我们来说运作良好，并且基于预期的增长，我们认为它也将为我们很好地扩展。

We've also noticed that SQL 2008 tends to handle large system catalogs better than SQL 2000 and SQL 2005 did. 我们还注意到SQL 2008倾向于比SQL 2000和SQL 2005更好地处理大型系统目录。 (We hadn't upgraded to 2008 when I posted this question.) （当我发布这个问题时，我们没有升级到2008年。）