简体   繁体   English

加速LINQ插入

[英]Speed up LINQ inserts

I have a CSV file and I have to insert it into a SQL Server database. 我有一个CSV文件,我必须将其插入SQL Server数据库。 Is there a way to speed up the LINQ inserts? 有没有办法加快LINQ插入?

I've created a simple Repository method to save a record: 我创建了一个简单的Repository方法来保存记录:

    public void SaveOffer(Offer offer)
    {
        Offer dbOffer = this.db.Offers.SingleOrDefault (
             o => o.offer_id == offer.offer_id);

        // add new offer
        if (dbOffer == null)
        {
            this.db.Offers.InsertOnSubmit(offer);
        }
        //update existing offer
        else
        {
            dbOffer = offer;
        }

        this.db.SubmitChanges();
    }

But using this method, the program is way much slower then inserting the data using ADO.net SQL inserts (new SqlConnection, new SqlCommand for select if exists, new SqlCommand for update/insert). 但是使用这种方法,程序比使用ADO.net SQL插入插入数据要慢得多(新的SqlConnection,新的SqlCommand用于选择是否存在,新的SqlCommand用于更新/插入)。

On 100k csv rows it takes about an hour vs 1 minute or so for the ADO.net way. 在100k csv行上,ADO.net方式需要大约一个小时,而大约需要1分钟左右。 For 2M csv rows it took ADO.net about 20 minutes. 对于2M csv行,ADO.net花了大约20分钟。 LINQ added about 30k of those 2M rows in 25 minutes. LINQ在25分钟内增加了大约30,000的2M行。 My database has 3 tables, linked in the dbml, but the other two tables are empty. 我的数据库有3个表,在dbml中链接,但其他两个表都是空的。 The tests were made with all the tables empty. 测试是在所有表空的情况下进行的。

PS I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy. PS我曾尝试使用SqlBulkCopy,但我需要在将其插入数据库之前对Offer进行一些转换,我认为这违背了SqlBulkCopy的目的。

Updates/Edits: After 18hours, the LINQ version added just ~200K rows. 更新/编辑:18小时后,LINQ版本增加了大约200K行。

I've tested the import just with LINQ inserts too, and also is really slow compared with ADO.net. 我也使用LINQ插件测试了导入,与ADO.net相比也非常慢。 I haven't seen a big difference between just inserts/submitchanges and selects/updates/inserts/submitchanges. 我没有看到插入/提交更改和选择/更新/插入/提交更改之间的巨大差异。

I still have to try batch commit, manually connecting to the db and compiled queries. 我仍然需要尝试批量提交,手动连接到db和编译的查询。

SubmitChanges does not batch changes, it does a single insert statement per object. SubmitChanges不会批量更改,它会为每个对象执行一次插入语句。 If you want to do fast inserts, I think you need to stop using LINQ. 如果你想快速插入,我认为你需要停止使用LINQ。

While SubmitChanges is executing, fire up SQL Profiler and watch the SQL being executed. 在SubmitChanges正在执行时,启动SQL事件探查器并观察正在执行的SQL。

See question "Can LINQ to SQL perform batch updates and deletes? Or does it always do one row update at a time?" 请参阅问题“LINQ to SQL可以执行批量更新和删除吗?或者它是否总是一次更新一行?” here: http://www.hookedonlinq.com/LINQToSQLFAQ.ashx 这里: http//www.hookedonlinq.com/LINQToSQLFAQ.ashx

It links to this article: http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx that uses extension methods to fix linq's inability to batch inserts and updates etc. 它链接到这篇文章: http//www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx ,它使用扩展方法来修复linq的无法批量插入和更新等

Have you tried wrapping the inserts within a transaction and/or delaying db.SubmitChanges so that you can batch several inserts? 您是否尝试在事务中包装插入和/或延迟db.SubmitChanges以便批量插入几个?

Transactions help throughput by reducing the needs for fsync()'s, and delaying db.SubmitChanges will reduce the number of .NET<->db roundtrips. 事务通过减少fsync()的需求来帮助吞吐量,并且延迟db.SubmitChanges将减少.NET < - > db往返的数量。

Edit: see http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html for some more optimization principles. 编辑:请参阅http://www.sidarok.com/web/blog/content/2008/05/02/10-tips-to-improve-your-linq-to-sql-application-performance.html以获得更多优化原则。

Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert instead of using LINQ's InsertOnSubmit() function. 查看下面的页面,了解如何更改代码以使用批量插入而不是使用LINQ的InsertOnSubmit()函数。

You just need to add the (provided) BulkInsert class to your code, make a few subtle changes to your code, and you'll see a huge improvement in performance. 您只需要将(提供的) BulkInsert类添加到代码中,对代码进行一些细微的更改,您就会看到性能的巨大提升。

Mikes Knowledge Base - BulkInserts with LINQ Mikes知识库 - 使用LINQ进行BulkInserts

Good luck ! 祝好运 !

I wonder if you're suffering from an overly large set of data accumulating in the data-context, making it slow to resolve rows against the internal identity cache (which is checked once during the SingleOrDefault , and for "misses" I would expect to see a second hit when the entity is materialized). 我想知道你是否在数据上下文中累积了过多的数据,这使得在内部身份缓存(在SingleOrDefault期间检查一次,以及我希望“未命中”)中解析行的速度很慢在实体实现时看到第二个命中)。

I can't recall 100% whether the short-circuit works for SingleOrDefault (although it will in .NET 4.0 ). 我不记得100%的短路是否适用于SingleOrDefault (虽然它将在.NET 4.0中 )。

I would try ditching the data-context (submit-changes and replace with an empty one) every n operations for some n - maybe 250 or something. 我会尝试放弃数据上下文(提交更改并替换为空的)每n次操作一些n - 可能250或者其他。


Given that you're calling SubmitChanges per isntance at the moment, you may also be wasting a lot of time checking the delta - pointless if you've only changed one row. 既然你调用SubmitChanges目前每isntance,你也可能会浪费大量的时间检查三角洲-毫无意义的,如果你只改变了一行。 Only call SubmitChanges in batches; 只批量调用SubmitChanges ; not per record. 不是每条记录。

Alex gave the best answer, but I think a few things are being over looked. 亚历克斯给出了最好的答案,但我认为有些事情正在被忽视。

One of the major bottlenecks you have here is calling SubmitChanges for each item individually. 您在这里遇到的主要瓶颈之一是分别为每个项目调用SubmitChanges。 A problem I don't think most people know about is that if you haven't manually opened your DataContext's connection yourself, then the DataContext will repeatedly open and close it itself. 我认为大多数人都不知道的一个问题是,如果您没有自己手动打开DataContext的连接,那么DataContext将自己重复打开和关闭它。 However, if you open it yourself, and then close it yourself when you're absolutely finished, things will run a lot faster since it won't have to reconnect to the database every time. 但是,如果你自己打开它,然后在你完成时自己关闭它,事情会运行得更快,因为它不必每次都重新连接到数据库。 I found this out when trying to find out why DataContext.ExecuteCommand() was so unbelievably slow when executing multiple commands at once. 当我试图找出为什么DataContext.ExecuteCommand()在一次执行多个命令时如此令人难以置信地缓慢时,我发现了这一点。

A few other areas where you could speed things up: 还有一些其他方面可以加快速度:

While Linq To SQL doesn't support your straight up batch processing, you should wait to call SubmitChanges() until you've analyzed everything first. 虽然Linq To SQL不支持您的直接批处理,但您应该等到调用SubmitChanges(),直到您首先分析了所有内容。 You don't need to call SubmitChanges() after each InsertOnSubmit call. 每次InsertOnSubmit调用后,您都不需要调用SubmitChanges()。

If live data integrity isn't super crucial, you could retrieve a list of offer_id back from the server before you start checking to see if an offer already exists. 如果实时数据完整性不是非常重要,您可以在开始检查商品是否已存在之前从服务器检索offer_id列表。 This could significantly reduce the amount of times you're calling the server to get an existing item when it's not even there. 这可能会显着减少您调用服务器以获取现有项目的次数。

Why not pass an offer[] into that method, and doing all the changes in cache before submitting them to the database. 为什么不将offer []传递给该方法,并在将缓存提交到数据库之前对缓存进行所有更改。 Or you could use groups for submission, so you don't run out of cache. 或者您可以使用组进行提交,因此您不会用完缓存。 The main thing would be how long till you send over the data, the biggest time wasting is in the closing and opening of the connection. 最重要的是你发送数据需要多长时间,浪费的最大时间是关闭和打开连接。

Converting this to a compiled query is the easiest way I can think of to boost your performance here: 将此转换为已编译的查询是我能想到的最简单的方法来提升您的性能:

Change the following: 更改以下内容:

    Offer dbOffer = this.db.Offers.SingleOrDefault (
         o => o.offer_id == offer.offer_id);

to: 至:

Offer dbOffer = RetrieveOffer(offer.offer_id);

private static readonly Func<DataContext, int> RetrieveOffer
{
   CompiledQuery.Compile((DataContext context, int offerId) => context.Offers.SingleOrDefault(o => o.offer_id == offerid))
}

This change alone will not make it as fast as your ado.net version, but it will be a significant improvement because without the compiled query you are dynamically building the expression tree every time you run this method. 仅此更改不会使其与ado.net版本一样快,但它将是一项重大改进,因为如果没有编译的查询,则每次运行此方法时都会动态构建表达式树。

As one poster already mentioned, you must refactor your code so that submit changes is called only once if you want optimal performance. 正如已经提到的一张海报,您必须重构代码,以便在您想要最佳性能时仅调用一次提交更改。

Do you really need to check if the record exist before inserting it into the DB. 在将记录插入数据库之前,您是否真的需要检查记录是否存在。 I thought it looked strange as the data comes from a csv file. 我认为它看起来很奇怪,因为数据来自csv文件。

PS I've tried to use SqlBulkCopy, but I need to do some transformations on Offer before inserting it into the db, and I think that defeats the purpose of SqlBulkCopy. PS我曾尝试使用SqlBulkCopy,但我需要在将其插入数据库之前对Offer进行一些转换,我认为这违背了SqlBulkCopy的目的。

I don't think it defeat the purpose at all, why would it? 我认为它根本没有打败目的,为什么会这样呢? Just fill a simple dataset with all the data from the csv and do a SqlBulkCopy. 只需用csv中的所有数据填充一个简单的数据集,然后执行SqlBulkCopy。 I did a similar thing with a collection of 30000+ rows and the import time went from minutes to seconds 我做了类似的事情,收集了30000多行,导入时间从几分钟到几秒钟

I suspect it isn't the inserting or updating operations that are taking a long time, rather the code that determines if your offer already exists: 我怀疑这不是花费很长时间的插入或更新操作,而是确定您的商品是否已经存在的代码:

Offer dbOffer = this.db.Offers.SingleOrDefault (
         o => o.offer_id == offer.offer_id);

If you look to optimise this, I think you'll be on the right track. 如果你想优化它,我想你会走上正轨。 Perhaps use the Stopwatch class to do some timing that will help to prove me right or wrong. 也许使用秒表课做一些有助于证明我对错的时机。

Usually, when not using Linq-to-Sql, you would have an insert/update procedure or sql script that would determine whether the record you pass already exists. 通常,当不使用Linq-to-Sql时,您将拥有一个插入/更新过程或sql脚本,用于确定您传递的记录是否已存在。 You're doing this expensive operation in Linq, which certainly will never hope to match the speed of native sql (which is what's happening when you use a SqlCommand and select if the record exists) looking-up on a primary key. 你正在Linq做这个昂贵的操作,这当然永远不会希望匹配本机sql的速度(这是当你使用SqlCommand时发生的事情并选择是否存在记录)查找主键。

Well you must understand linq creates code dynamically for all ADO operations that you do instead handwritten, so it will always take up more time then your manual code. 那么你必须了解linq为你所做的所有ADO操作动态创建代码而不是手写,所以它总是比你的手动代码花费更多的时间。 Its simply an easy way to write code but if you want to talk about performance, ADO.NET code will always be faster depending upon how you write it. 它只是编写代码的简单方法,但如果你想谈论性能,ADO.NET代码总是会更快,这取决于你如何编写代码。

I dont know if linq will try to reuse its last statement or not, if it does then seperating insert batch with update batch may improve performance little bit. 我不知道linq是否会尝试重复使用它的最后一个语句,如果确实如此,那么使用更新批处理分离插入批处理可能会略微提高性能。

This code runs ok, and prevents large amounts of data: 此代码运行正常,并阻止大量数据:

if (repository2.GeoItems.GetChangeSet().Inserts.Count > 1000)
{
    repository2.GeoItems.SubmitChanges();
}

Then, at the end of the bulk insertion, use this: 然后,在批量插入结束时,使用:

repository2.GeoItems.SubmitChanges();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM