简体繁体 English

管理大量数据 - 存储过程或数据集或其他...？

[英]Managing large volumes of data - stored procedures or datasets or other...?

原文 2008-09-16 10:28:44 1 3 c#/ database

I have an application that imports large volumes of data daily, several 100 thousands records.我有一个每天导入大量数据的应用程序，有几十万条记录。
Data comes from different sources.数据来自不同的来源。 The data is read using C#, then bulk inserted into the database.使用 C# 读取数据，然后批量插入到数据库中。

This data is then processed:然后处理这些数据：

different tables are linked不同的表被链接
new tables are generated生成新表
data is corrected using complicated algorithmns (totals of certain tables have to total zero)使用复杂的算法更正数据（某些表的总数必须为零）

Most of this processing is done in stored procedures.大多数这种处理是在存储过程中完成的。
Although some of the complex processing would be simpler in C#, the extraction of the data into a dataset and its reinjection would slow things down considerably.尽管在 C# 中一些复杂的处理会更简单，但将数据提取到数据集并重新注入会大大减慢速度。
You may ask why I do not process the data before inserting it into the database, but I do not think it practical to manipulate 100,000s of records in memory, and the SQLs set based commands help when creating lots of records.您可能会问为什么我在将数据插入数据库之前不对其进行处理，但我认为在内存中操作 100,000 条记录是不切实际的，并且基于 SQL 集的命令在创建大量记录时会有所帮助。

This will probably spark up the age old question of using stored procedures and their pros and cons.这可能会引发关于使用存储过程及其优缺点的古老问题。 (eg. How do you unit test stored procedures?) （例如，您如何对存储过程进行单元测试？）

What I would like in response, is your experience with large volumes of data and how you tackled the problem.作为回应，我想要的是您处理大量数据的经验以及您如何解决问题。

3 个解决方案

I would use SSIS or DTS (assuming you are talking about MSSQL).我会使用 SSIS 或 DTS（假设您在谈论 MSSQL）。 They are made for that purpose and work with SPs if you need them.它们是为此目的而制作的，如果您需要它们，可以与 SP 一起使用。

Another option is to preprocess the data using Perl.另一种选择是使用 Perl 预处理数据。 Even though it sounds like a wierd suggestion, Perl is actually extremely fast in these scenarios.尽管这听起来像是一个奇怪的建议，但 Perl 在这些场景中实际上非常快。 I've used it in the past to process billions of records in reasonable time (ie days instead of weeks).我过去曾使用它在合理的时间内（即几天而不是几周）处理数十亿条记录。

Regarding "How do you Unit Test store procedures", you unit test them with MBUnit like anything else.关于“您如何对存储过程进行单元测试”，您可以像其他任何东西一样使用 MBUnit 对它们进行单元测试。 Only bit of advice: the setup and rollback of the data can be tricky, you can either use a DTS transaction or explicit SQL statements.只有一点建议：数据的设置和回滚可能很棘手，您可以使用 DTS 事务或显式 SQL 语句。

I would generally have to agree with Skliwz when it comes to doing things in MSSQL.当谈到在 MSSQL 中做事时，我通常不得不同意 Skliwz。 SSIS and DTS are the way to go, but if you are unfamiliar with those technologies they can be cumbersome to work with. SSIS 和 DTS 是可行的方法，但如果您不熟悉这些技术，则使用它们可能会很麻烦。 However, there is an alternative that would allow you to do the processing in C#, and still keep your data inside of SQL Server.但是，有一种替代方法可以让您在 C# 中进行处理，并且仍然将您的数据保存在 SQL Server 中。

If you really think the processing would be simpler in C# then you may want to look into using a SQL Server Project to create database objects using C# .如果您真的认为在 C# 中处理会更简单，那么您可能需要考虑使用SQL Server 项目来使用 C#创建数据库对象。 There are a lot of really powerful things you can do with CLR objects inside of SQL Server, and this would allow you to write and unit test the code before it ever touches the database.您可以使用 SQL Server 内部的 CLR 对象做很多非常强大的事情，这将允许您在代码接触数据库之前编写和单元测试代码。 You can unit test your CLR code inside of VS using any of the standard unit testing frameworks (NUnit, MSTest), and you don't have to write a bunch of set up and tear down scripts that can be difficult to manage.您可以使用任何标准单元测试框架（NUnit、MSTest）在 VS 内对 CLR 代码进行单元测试，而且您不必编写一堆难以管理的设置和拆除脚本。

As far as testing your stored procedures I would honestly look into DBFit for that.至于测试您的存储过程，我会诚实地研究DBFit 。 Your database doesn't have to be a black hole of untested functionality any more :)您的数据库不必再成为未经测试功能的黑洞:)

Where you process data depends greatly on what you're doing.您在哪里处理数据在很大程度上取决于您在做什么。 If you need, for example, to discard data which you don't want in your database, then you would process that in your C# code.例如，如果您需要丢弃数据库中不需要的数据，则可以在 C# 代码中处理该数据。 However, data to process in the database should generally be data which should be "implementation agnostic".但是，要在数据库中处理的数据通常应该是“与实现无关”的数据。 So if someone else wants to insert data from a Java client, the database should be able to reject bad data.因此，如果其他人想要从 Java 客户端插入数据，数据库应该能够拒绝错误数据。 If you put that logic into your C# code, the Java code won't know about it.如果您将该逻辑放入 C# 代码中，Java 代码将不会知道它。

Some people object and say "but I'll never use another language for the database!"有些人反对并说“但我永远不会为数据库使用另一种语言！” Even if that's true, you'll still have DBAs or developers working with the database and they'll make mistakes if the logic isn't there.即使这是真的，您仍然会有 DBA 或开发人员使用数据库，如果逻辑不存在，他们会犯错误。 Or your new C# developer will try to shove in data and not know about (or just ignore) data pre-processors written in C#.或者，您的新 C# 开发人员将尝试塞入数据而不知道（或只是忽略）用 C# 编写的数据预处理器。

In short, the logic you put in your database should be enough to guarantee that the data is correct without relying on external software.简而言之，你放在数据库中的逻辑应该足以保证数据正确，而无需依赖外部软件。