简体   繁体   English

600K记录的数据库或平面文件?

[英]Database or flat file for 600K records?

I'm writing a C# application which needs to insert about 600K records into a database at a certain point in time. 我正在编写一个C#应用程序,需要在某个时间点将大约600K条记录插入数据库。

They are very simple records: just 3 longs. 它们是非常简单的记录:只有3个长。

I'm using params to set up the command, and then loop through the data in memory to make the inserts, assigning the values to the command parameter's at each loop and running command.ExecuteNonQuery() 我正在使用params来设置命令,然后循环遍历内存中的数据以进行插入,在每个循环中将值分配给命令参数并运行command.ExecuteNonQuery()

It takes about 50 seconds to finish on SqlServer, and it's even slower on MySql while inserting the same data on a flat file only takes a few miliseconds. 在SqlServer上完成大约需要50秒,而在MySql上它甚至更慢,而在平面文件上插入相同的数据只需要几毫秒。

Am I doing something wrong or the database simply too slow? 我做错了什么或数据库太慢了?

You will see greater speed writing to a flat file for a few reasons: 由于以下几个原因,您将看到写入平面文件的速度更快:

  • ExecuteNonQuery does not group multiple insert statements into batches, so you are incurring a full inter-process communication turnaround per record. ExecuteNonQuery不会将多个插入语句分组到批处理中,因此每个记录都会产生一个完整的进程间通信周转时间。 Send your insert statements in groups. 以组的形式发送插入语句。
  • The data you have are already in the shape of a flat file, so you can fire it all off in one write, or a few writes with buffering. 您拥有的数据已经是平面文件的形状,因此您可以通过一次写入或一些带缓冲的写入来解除所有数据。
  • Database operations tend to use trees which take n log n time, while a simple array-shaped construct will take linear time. 数据库操作倾向于使用n log n时间的树,而简单的数组形状构造将花费线性时间。 On the other hand, if you're merging into a sorted flat file, that will take a while. 另一方面,如果您要合并到已排序的平面文件中,则需要一段时间。

If all you need is to insert the data and never read it back then you can write a noop function and pretend you inserted them in /dev/nul. 如果您只需要插入数据并且从不读回来那么您可以编写一个noop函数并假装您将它们插入到/ dev / nul中。 The real question is how do you plan to consume the said data ? 真正的问题是你打算如何消费这些数据 Do you need to interrogate, filter, sort, reference the individual records? 您是否需要查询,过滤,排序,引用各个记录? Ie. IE浏览器。 why did you even consider a database to start with, if a flat file appears to be just as good? 为什么你甚至考虑一个数据库开始,如果一个平面文件看起来一样好?

With SQL Server you can certainly achieve better performance with a database and insert at a rate of about 50-100k per second at least. 使用SQL Server,您当然可以使用数据库获得更好的性能,并且至少以每秒约50-100k的速率插入。 Your current chocking point is probably the lgo flush on each insert. 您当前的阻塞点可能是每个刀片上的lgo冲洗。 You must batch commits and make sure your log is on a fast array of spindles. 您必须批量提交并确保您的日志位于快速的主轴阵列上。 Start a transaction, insert roughly enough records to fill a log page (64kb) then commit. 启动一个事务,插入足够大的记录来填充日志页面(64kb)然后提交。 Also is worth using a battery of 5-10 SqlCommands and connections and use async commands (BeginExecuteNonReader with callback) to launch multiple inserts in parallel, this way you can leverage all dead times you now loose in network round-trip and execution context preparation. 同样值得使用5-10个SqlCommands和连接的电池,并使用异步命令(带回调的BeginExecuteNonReader)并行启动多个插入,这样您就可以利用现在在网络往返和执行上下文准备中丢失的所有死区时间。

So that's about 8 milliseconds for a single row versus about that for the entire file. 因此,单行约为8毫秒,而整个文件约为8毫秒。 Fair? 公平?

A database certainly has a lot more potentially going on: 数据库当然还有很多可能发生的事情:

  1. Parsing, validating, executing SQL 解析,验证,执行SQL
  2. Calculating the values of any indexes 计算任何索引的值
  3. Managing rollback logs if this is a single transaction 如果这是单个事务,则管理回滚日志
  4. Writing to its own file 写入自己的文件

I'll assume that you're running locally, so there's no network latency to include. 我假设您在本地运行,因此不需要包含网络延迟。

So I would guess that a database is slower. 所以我猜想数据库速度较慢。 I wouldn't have thought 600K times slower, though. 不过,我不会想到600K的速度。

Are you doing a bulk insert? 你在做批量插入吗? I'd use it if you arnt already. 如果你已经存在,我会用它。

INSERT INTO dbo.NewTable(fields) 
SELECT fields 
FROM dbo.oldTable 
WHERE ...

In the above example you would want to ensure the tables used in the select statement have the appropriate indexes... correctly assigning the clustered index to the most relevant field. 在上面的示例中,您需要确保select语句中使用的表具有适当的索引...正确地将聚簇索引分配给最相关的字段。

If the select statement is slow, check the execution plan to possibly find the bottleneck. 如果select语句很慢,请检查执行计划以找到瓶颈。

I can't help you much with MySQL. MySQL对你帮助不大。 However, SQL Server 2005 and greater have some pretty intriguing XML support that might help you out. 但是,SQL Server 2005及更高版本具有一些非常有趣的XML支持,可能会帮助您。 I recommend looking into Updategrams, a feature that allows you to submit a batch of data to be inserted, updated, or deleted. 我建议您查看Updategrams,这项功能允许您提交要插入,更新或删除的一批数据。 This might help you improve the performance with SQL Server, as you only need to issue a single statement rather than 600,000 statements. 这可能有助于您提高SQL Server的性能,因为您只需要发出一个语句而不是600,000个语句。 I am not sure it would be quite as fast as writing to a raw file, but it should be significantly faster than issuing individual statements. 我不确定它是否会像写入原始文件一样快,但它应该比发出单个语句快得多。

You can start learning about updategrams here: http://msdn.microsoft.com/en-us/library/aa258671(SQL.80).aspx 你可以在这里开始学习更新图: http//msdn.microsoft.com/en-us/library/aa258671( SQL.80) .aspx

As Alex said: use SqlBulkCopy, nothing beats it when it comes to performance. 正如亚历克斯所说:使用SqlBulkCopy,在性能方面没有什么能比得上它。

It is a bit tricky to use, for sample code have a look here: 使用起来有点棘手,示例代码请看这里:

http://github.com/SamSaffron/So-Slow/blob/1552b1293525bfe36f6c9b522e370de626ac6f05/Importer.cs http://github.com/SamSaffron/So-Slow/blob/1552b1293525bfe36f6c9b522e370de626ac6f05/Importer.cs

Ayende has some interesting code to batch up exactly these ExecuteNonQuery situations. Ayende有一些有趣的代码可以批量处理这些ExecuteNonQuery情况。 Opening Up Query Batching was the intro post where he talks about SqlCommandSet , then releases the code in There Be Dragons: Rhino.Commons.SqlCommandSet . Open Up Query Batching是介绍SqlCommandSet的介绍帖子,然后在There Be Dragons:Rhino.Commons.SqlCommandSet中发布代码。

If you can optimise for SQL2008, you could also try the shiny new table value parameters. 如果您可以针对SQL2008进行优化,您还可以尝试闪亮的新表值参数。 This sqlteam article is a good intro to them. 这篇sqlteam文章是他们的一个很好的介绍。

you are probably running the command over and over against the database server, what if you construct a command text that includes multiple inserts and then run this ? 您可能正在对数据库服务器一遍又一遍地运行命令,如果您构造包含多个插入的命令文本然后运行它,该怎么办? ie

string commandText = "insert into x ( y, z) values ( 1, 2 );\r\n"
commandText += "insert into x ( y, z) values ( 2, 3 );"

command.Text = commandText;
command.ExecuteNonQuery();

If you do not require many concurrent users try using MS-Jet, ie "Microsoft Access" as your DBMS. 如果您不需要许多并发用户尝试使用MS-Jet,即“Microsoft Access”作为您的DBMS。 The MSJet performance can be about 10x faster than SqlServer. MSJet性能可以比SqlServer快约10倍。 BTW, inserting 600k records in just 50 seconds (12k/sec) is very fast for SqlServer. 顺便说一句,对于SqlServer来说,在50秒(12k / sec)内插入600k记录非常快。

My guess is that you're doing transactional inserts: inserts that look like this: 我的猜测是你正在进行事务性插入:插入看起来像这样:

INSERT INTO dbo.MyTable (Field1, Field2, Field3)
VALUES (50, 100, 150)

That'll work, but like you've found, that doesn't scale. 这会起作用,但就像你发现的那样,它不会扩展。 In order to push a lot of data into SQL Server very quickly, there are tools and techniques to pull it off. 为了将大量数据快速地推送到SQL Server,有一些工具和技术可以实现它。

Probably the simplest way to do it is with BCP. 可能最简单的方法是使用BCP。 Here's a couple of links about it: 这里有几个关于它的链接:

Next, you'll want to set up SQL Server in order to insert as many records as possible. 接下来,您将要设置SQL Server以插入尽可能多的记录。 Is your database in full recovery mode or simple recovery mode? 您的数据库是处于完全恢复模式还是简单恢复模式? To find out, go into SQL Server Management Studio, right-click on the database name, and click Properties. 要找到答案,请进入SQL Server Management Studio,右键单击数据库名称,然后单击“属性”。 Full recovery mode will log every transaction, but simple recovery mode will run somewhat faster. 完全恢复模式将记录每个事务,但简单恢复模式将运行得更快。 Are the data files and log files located on separate arrays? 数据文件和日志文件是否位于不同的阵列上? How many drives are in each array, and what RAID type is it (1, 5, 10)? 每个阵列中有多少个驱动器,它是什么类型的RAID(1,5,10)? If both the data and log files are on the C drive, for example, you'll have poor performance. 例如,如果数据和日志文件都在C驱动器上,那么性能会很差。

Next, you'll want to set up your table, too. 接下来,您也要设置表格。 Do you have constraints and indexes on the table? 你在桌子上有约束和索引吗? Do you have other records in it already, and do you have other people querying it at the same time? 你有没有其他记录,你有其他人在同一时间查询它吗? If so, consider building an empty table for data loads with no indexes or constraints. 如果是这样,请考虑为没有索引或约束的数据加载构建一个空表。 Dump all the data in there as fast as possible, and then apply the constraints or indexes, or move the data into its final destination. 尽可能快地转储其中的所有数据,然后应用约束或索引,或将数据移动到其最终目标。

my SQL Server 2005 solution 我的SQL Server 2005解决方案

StringBuilder sb = new StringBuilder();
bool bFirst = true;

foreach(Record r in myData)
{
    if (bFirst)
        sb.AppendLine("INSERT INTO tbl (f1, f2, f3)");
    else
        sb.AppendLine("UNION ALL");
    bFirst = false;

    sb.AppendLine("SELECT " + r.data1.ToString() + "," + 
        r.data2.ToString() + "," + r.data3.ToString());
}

SqlCommand cmd = new SqlCommand(sb.ToString(), conn);
cmd.ExecuteNonQuery();

wonder how that would perform ;) 想知道它会如何表现;)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM