简体   繁体   English

在 SQL Server(C# 客户端)中批量插入大量数据的最快方法是什么

[英]What's the fastest way to bulk insert a lot of data in SQL Server (C# client)

I am hitting some performance bottlenecks with my C# client inserting bulk data into a SQL Server 2005 database and I'm looking for ways in which to speed up the process.我的 C# 客户端将批量数据插入 SQL Server 2005 数据库时遇到了一些性能瓶颈,我正在寻找加快该过程的方法。

I am already using the SqlClient.SqlBulkCopy (which is based on TDS) to speed up the data transfer across the wire which helped a lot, but I'm still looking for more.我已经在使用 SqlClient.SqlBulkCopy(它基于 TDS)来加快跨线路的数据传输,这很有帮助,但我仍在寻找更多。

I have a simple table that looks like this:我有一个简单的表格,如下所示:

 CREATE TABLE [BulkData](
 [ContainerId] [int] NOT NULL,
 [BinId] [smallint] NOT NULL,
 [Sequence] [smallint] NOT NULL,
 [ItemId] [int] NOT NULL,
 [Left] [smallint] NOT NULL,
 [Top] [smallint] NOT NULL,
 [Right] [smallint] NOT NULL,
 [Bottom] [smallint] NOT NULL,
 CONSTRAINT [PKBulkData] PRIMARY KEY CLUSTERED 
 (
  [ContainerIdId] ASC,
  [BinId] ASC,
  [Sequence] ASC
))

I'm inserting data in chunks that average about 300 rows where ContainerId and BinId are constant in each chunk and the Sequence value is 0-n and the values are pre-sorted based on the primary key.我在平均大约 300 行的块中插入数据,其中 ContainerId 和 BinId 在每个块中是恒定的,并且序列值是 0-n,并且这些值是根据主键预先排序的。

The %Disk time performance counter spends a lot of time at 100% so it is clear that disk IO is the main issue but the speeds I'm getting are several orders of magnitude below a raw file copy. %Disk time 性能计数器在 100% 上花费了大量时间,因此很明显磁盘 IO 是主要问题,但我得到的速度比原始文件副本低几个数量级。

Does it help any if I:如果我有帮助:

  1. Drop the Primary key while I am doing the inserting and recreate it later在我进行插入时删除主键并稍后重新创建它
  2. Do inserts into a temporary table with the same schema and periodically transfer them into the main table to keep the size of the table where insertions are happening small插入到具有相同模式的临时表中,并定期将它们转移到主表中,以保持发生插入的表的大小较小
  3. Anything else?还要别的吗?

-- Based on the responses I have gotten, let me clarify a little bit: ——根据我得到的答复,让我澄清一下:

Portman: I'm using a clustered index because when the data is all imported I will need to access data sequentially in that order. Portman:我正在使用聚集索引,因为当数据全部导入后,我需要按顺序依次访问数据。 I don't particularly need the index to be there while importing the data.导入数据时,我并不特别需要索引。 Is there any advantage to having a nonclustered PK index while doing the inserts as opposed to dropping the constraint entirely for import?在进行插入时使用非聚集 PK 索引而不是完全删除约束以进行导入有什么好处?

Chopeen: The data is being generated remotely on many other machines (my SQL server can only handle about 10 currently, but I would love to be able to add more). Chopeen:数据是在许多其他机器上远程生成的(我的 SQL 服务器目前只能处理大约 10 个,但我希望能够添加更多)。 It's not practical to run the entire process on the local machine because it would then have to process 50 times as much input data to generate the output.在本地机器上运行整个过程是不切实际的,因为它必须处理 50 倍的输入数据才能生成输出。

Jason: I am not doing any concurrent queries against the table during the import process, I will try dropping the primary key and see if that helps. Jason:在导入过程中,我没有对表进行任何并发查询,我会尝试删除主键,看看是否有帮助。

Here's how you can disable/enable indexes in SQL Server:以下是在 SQL Server 中禁用/启用索引的方法:

--Disable Index ALTER INDEX [IX_Users_UserID] SalesDB.Users DISABLE
GO
--Enable Index ALTER INDEX [IX_Users_UserID] SalesDB.Users REBUILD

Here are some resources to help you find a solution:以下是一些可帮助您找到解决方案的资源:

Some bulk loading speed comparisons 一些批量加载速度比较

Use SqlBulkCopy to Quickly Load Data from your Client to SQL Server 使用 SqlBulkCopy 将数据从客户端快速加载到 SQL Server

Optimizing Bulk Copy Performance 优化批量复制性能

Definitely look into NOCHECK and TABLOCK options:一定要研究 NOCHECK 和 TABLOCK 选项:

Table Hints (Transact-SQL)表提示 (Transact-SQL)

INSERT (Transact-SQL)插入 (Transact-SQL)

You're already using SqlBulkCopy , which is a good start.您已经在使用SqlBulkCopy ,这是一个好的开始。

However, just using the SqlBulkCopy class does not necessarily mean that SQL will perform a bulk copy.但是,仅仅使用 SqlBulkCopy 类并不一定意味着 SQL 会执行大容量复制。 In particular, there are a few requirements that must be met for SQL Server to perform an efficient bulk insert.特别是,SQL Server 必须满足一些要求才能执行有效的批量插入。

Further reading:进一步阅读:

Out of curiosity, why is your index set up like that?出于好奇,为什么你的索引是这样设置的? It seems like ContainerId/BinId/Sequence is much better suited to be a nonclustered index.看起来 ContainerId/BinId/Sequence适合作为非聚集索引。 Is there a particular reason you wanted this index to be clustered?您希望这个索引聚集在一起有什么特别的原因吗?

My guess is that you'll see a dramatic improvement if you change that index to be nonclustered .我的猜测是,如果您将该索引更改为 nonclustered,您会看到显着的改进。 This leaves you with two options:这使您有两个选择:

  1. Change the index to nonclustered, and leave it as a heap table, without a clustered index将索引更改为非聚集索引,并将其保留为堆表,没有聚集索引
  2. Change the index to nonclustered, but then add a surrogate key (like "id") and make it an identity, primary key, and clustered index将索引更改为非聚集索引,然后添加代理键(如“id”)并使其成为标识、主键和聚集索引

Either one will speed up your inserts without noticeably slowing down your reads.任何一个都可以加快您的插入速度,而不会显着减慢您的读取速度。

Think about it this way -- right now, you're telling SQL to do a bulk insert, but then you're asking SQL to reorder the entire table every table you add anything.这样想——现在,您告诉 SQL 进行批量插入,但随后您要求 SQL 对您添加任何内容的每个表重新排序。 With a nonclustered index, you'll add the records in whatever order they come in, and then build a separate index indicating their desired order.使用非聚集索引,您将按照它们进入的任何顺序添加记录,然后构建一个单独的索引来指示它们所需的顺序。

Have you tried using transactions?您是否尝试过使用交易?

From what you describe, having the server committing 100% of the time to disk, it seems you are sending each row of data in an atomic SQL sentence thus forcing the server to commit (write to disk) every single row.根据您的描述,让服务器将 100% 的时间提交到磁盘,您似乎正在以原子 SQL 语句发送每一行数据,从而迫使服务器每行提交(写入磁盘)。

If you used transactions instead, the server would only commit once at the end of the transaction.如果您改用事务,则服务器只会在事务结束时提交一次

For further help: What method are you using for inserting data to the server?如需进一步帮助:您使用什么方法将数据插入服务器? Updating a DataTable using a DataAdapter, or executing each sentence using a string?使用 DataAdapter 更新 DataTable,还是使用字符串执行每个句子?

I'm not really a bright guy and I don't have a lot of experience with the SqlClient.SqlBulkCopy method but here's my 2 cents for what it's worth.我不是一个真正聪明的人,我对 SqlClient.SqlBulkCopy 方法没有太多经验,但这是我的 2 美分,它的价值。 I hope it helps you and others (or at least causes people to call out my ignorance ;).我希望它可以帮助您和其他人(或至少使人们指出我的无知;)。

You will never match a raw file copy speed unless your database data file (mdf) is on a separate physical disk from your transaction log file (ldf).除非您的数据库数据文件 (mdf) 与事务日志文件 (ldf) 位于单独的物理磁盘上,否则您将永远无法匹配原始文件复制速度。 Additionally, any clustered indexes would also need to be on a separate physical disk for a fairer comparison.此外,任何聚集索引也需要位于单独的物理磁盘上,以便进行更公平的比较。

Your raw copy is not logging or maintaining a sort order of select fields (columns) for indexing purposes.您的原始副本没有记录或维护选择字段(列)的排序顺序以用于索引目的。

I agree with Portman on creating a nonclustered identity seed and changing your existing nonclustered index to a clustered index.我同意 Portman 关于创建非聚集标识种子并将现有的非聚集索引更改为聚集索引的观点。

As far as what construct you're using on the clients...(data adapter, dataset, datatable, etc).至于您在客户端上使用的构造...(数据适配器、数据集、数据表等)。 If your disk io on the server is at 100%, I don't think your time is best spent analyzing client constructs as they appear to be faster than the server can currently handle.如果服务器上的磁盘 io 为 100%,我认为最好不要花时间分析客户端构造,因为它们似乎比服务器当前可以处理的速度更快。

If you follow Portman's links about minimal logging, I wouldn't think surrounding your bulk copies in transactions would help a lot if any but I've been wrong many times in my life ;)如果您关注 Portman 关于最小日志记录的链接,我认为围绕事务中的批量副本不会有很大帮助,但我在我的生活中犯过很多次错误;)

This won't necessarily help you right now but if you figure out your current issue, this next comment might help with the next bottleneck (network throughput) - especially if it's over the Internet...这不一定对您现在有帮助,但如果您弄清楚当前的问题,下一条评论可能有助于解决下一个瓶颈(网络吞吐量) - 特别是如果它通过 Internet...

Chopeen asked an interesting question too. Chopeen 也问了一个有趣的问题。 How did you determine to use 300 record count chunks to insert?您如何确定使用 300 个记录计数块来插入? SQL Server has a default packet size (I believe it is 4096 bytes) and it would make sense to me to derive the size of your records and ensure that you are making efficient use of the packets transmitting between client and server. SQL Server 有一个默认的数据包大小(我相信它是 4096 字节),对我来说,推导出记录的大小并确保您有效地利用客户端和服务器之间传输的数据包是有意义的。 (Note, you can change your packet size on your client code as opposed to the server option which would obviously change it for all server communications - probably not a good idea.) For instance, if your record size results in 300 record batches requiring 4500 bytes, you will send 2 packets with the second packet being mostly wasted. (注意,您可以在客户端代码上更改数据包大小,而不是服务器选项,这显然会针对所有服务器通信更改它 - 可能不是一个好主意。)例如,如果您的记录大小导致 300 个记录批次需要 4500字节,您将发送 2 个数据包,而第二个数据包大部分被浪费。 If batch record count was arbitrarily assigned, it might make sense to do some quick easy math.如果批量记录计数是任意分配的,那么做一些快速简单的数学运算可能是有意义的。

From what I can tell (and remember about data type sizes) you have exactly 20 bytes for each record (if int=4 bytes and smallint=2 bytes).据我所知(并记住数据类型大小),每条记录恰好有 20 个字节(如果 int=4 字节和 smallint=2 字节)。 If you are using 300 record count batches, then you are trying to send 300 x 20 = 6,000 bytes (plus I'm guessing a little overhead for the connection, etc).如果您使用 300 个记录计数批次,那么您正在尝试发送 300 x 20 = 6,000 字节(另外我猜测连接会产生一些开销等)。 You might be more efficient to send these up in 200 record count batches (200 x 20 = 4,000 + room for overhead) = 1 packet.以 200 个记录计数批次(200 x 20 = 4,000 + 开销空间)= 1 个数据包发送这些可能会更有效。 Then again, your bottleneck still appears to be the server's disk io.再说一次,您的瓶颈似乎仍然是服务器的磁盘 io。

I realize you're comparing a raw data transfer to the SqlBulkCopy with the same hardware/configuration but here's where I would go also if the challenge was mine:我意识到您正在将原始数据传输与具有相同硬件/配置的 SqlBulkCopy 进行比较,但如果挑战是我的,我也会去这里:

This post probably won't help you anymore as it's rather old but I would next ask what your disk's RAID configuration is and what speed of disk are you using?这篇文章可能不再对您有帮助,因为它已经很老了,但是我接下来会问您磁盘的 RAID 配置是什么以及您使用的磁盘速度是多少? Try putting the log file on a drive that uses RAID 10 with a RAID 5 (ideally 1) on your data file.尝试将日志文件放在数据文件上使用 RAID 10 和 RAID 5(理想情况下为 1)的驱动器上。 This can help reduce a lot of spindle movement to different sectors on the disk and result in more time reading/writing instead of the unproductive "moving" state.这可以帮助减少大量主轴移动到磁盘上的不同扇区,并导致更多时间读/写,而不是非生产性的“移动”状态。 If you already separate your data and log files, do you have your index on a different physical disk drive from your data file (you can only do this with clustered indexes).如果您已经将数据和日志文件分开,您的索引是否与数据文件位于不同的物理磁盘驱动器上(您只能使用聚集索引来执行此操作)。 That would allow for not only concurrently updating logging information with data inserting but would allow index inserting (and any costly index page operations) to occur concurrently.这不仅允许在插入数据的同时更新日志信息,而且允许同时进行索引插入(以及任何代价高昂的索引页操作)。

BCP - it's a pain to set up, but it's been around since the dawn of DBs and it's very very quick. BCP - 设置起来很痛苦,但它从 DB 诞生之初就已经存在,而且速度非常快。

Unless you're inserting data in that order the 3-part index will really slow things.除非您按该顺序插入数据,否则 3 部分索引确实会减慢速度。 Applying it later will really slow things too, but will be in a second step.稍后应用它也确实会减慢速度,但将是第二步。

Compound keys in Sql are always quite slow, the bigger the key the slower. Sql中的复合键总是很慢,键越大越慢。

I think that it sounds like this could be done using SSIS packages .我认为这听起来可以使用SSIS 包来完成。 They're similar to SQL 2000's DTS packages.它们类似于 SQL 2000 的 DTS 包。 I've used them to successfully transform everything from plain text CSV files, from existing SQL tables, and even from XLS files with 6-digit rows spanned across multiple worksheets.我已经使用它们成功地转换了纯文本 CSV 文件、现有 SQL 表,甚至是跨多个工作表的 6 位行的 XLS 文件。 You could use C# to transform the data into an importable format (CSV, XLS, etc), then have your SQL server run a scheduled SSIS job to import the data.您可以使用 C# 将数据转换为可导入的格式(CSV、XLS 等),然后让您的 SQL 服务器运行计划的 SSIS 作业来导入数据。

It's pretty easy to create an SSIS package, there's a wizard built-into SQL Server's Enterprise Manager tool (labeled "Import Data" I think), and at the end of the wizard it gives you the option of saving it as an SSIS package.创建 SSIS 包非常容易,SQL Server 的企业管理器工具中内置了一个向导(我认为标记为“导入数据”),在向导结束时,您可以选择将其保存为 SSIS 包。 There's a bunch more info on Technet as well. Technet上还有更多信息。

Still facing the issue?还在面对这个问题吗? Try this one too.也试试这个。

  • Check the database configuration( Memory & Processor ).检查数据库配置(内存和处理器)。
  • For bulky data's , I would suggest the Memory of at least 16GB and Processor of 16对于大容量数据,我建议至少 16GB 的内存16 的处理器

Yes your ideas will help.是的,你的想法会有所帮助。
Lean on option 1 if there are no reads happening while your loading.如果在加载时没有发生读取,请使用选项 1。
Lean on option 2 if you destination table is being queried during your processing.如果在处理过程中查询目标表,请使用选项 2。

@Andrew @安德鲁
Question.问题。 Your inserting in chunks of 300. What is the total amount your inserting?您插入 300 块。您插入的总量是多少? SQL server should be able to handle 300 plain old inserts very fast. SQL 服务器应该能够非常快速地处理 300 个普通的旧插入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 批量插入还是使用C#和SQL Server一次加载大量数据的另一种方式? - Bulk Insert or Another way to load a lot of data at once using C# and SQL server? 从C#在SQL Server中插入记录的最快方法是什么 - What is the fastest way to insert record in SQL Server from C# 使用C#将Bulk Xml(XElement)数据插入Sql server Table的最佳方法是什么? - What is the best way to insert Bulk Xml (XElement) data to Sql server Table using C#? 使用C#在SQL Server中保存数据的最快方法是什么? - What is the fastest way to save data in SQL Server using C#? C#将数据插入SQL数据库的最快方法 - C# fastest way to insert data into SQL database 从C#SQLCLR存储过程批量插入SQL表的最快方法 - Fastest way to bulk insert in SQL Table from C# SQLCLR Stored Procedure 使用C#在SQL Server上的临时表中插入3万行的最快方法 - Fastest way to insert 30 thousand rows in a temp table on SQL Server with C# 将从文本文件解析的250万行数据插入Sql Server的最快方法是什么 - What would be the fastest way to insert 2.5 million rows of data parsed from a text file, into Sql server 将大量记录插入SQL Server数据库的最快方法是什么? - What is the fastest way to insert a large amount of records into a SQL Server DB? 在Java服务器和C#客户端之间共享数据对象的最快方法 - Fastest way to share data objects between a Java server and a C# client
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM