简体   繁体   English

SQL Server数据库调用的多线程C#应用程序

[英]Multi threading C# application with SQL Server database calls

I have a SQL Server database with 500,000 records in table main . 我有一个SQL Server数据库,表main有500,000条记录。 There are also three other tables called child1 , child2 , and child3 . 还有其他三个表叫child1child2child3 The many to many relationships between child1 , child2 , child3 , and main are implemented via the three relationship tables: main_child1_relationship , main_child2_relationship , and main_child3_relationship . 很多之间多对多关系child1child2child3main :通过三个关系表来实现main_child1_relationshipmain_child2_relationshipmain_child3_relationship I need to read the records in main , update main , and also insert into the relationship tables new rows as well as insert new records in the child tables. 我需要读取main ,update main的记录,并在关系表中插入新行以及在子表中插入新记录。 The records in the child tables have uniqueness constraints, so the pseudo-code for the actual calculation (CalculateDetails) would be something like: 子表中的记录具有唯一性约束,因此实际计算的伪代码(CalculateDetails)将类似于:

for each record in main
{
   find its child1 like qualities
   for each one of its child1 qualities
   {
      find the record in child1 that matches that quality
      if found
      {
          add a record to main_child1_relationship to connect the two records
      }
      else
      {
          create a new record in child1 for the quality mentioned
          add a record to main_child1_relationship to connect the two records
      }
   }
   ...repeat the above for child2
   ...repeat the above for child3 
}

This works fine as a single threaded app. 这可以作为单线程应用程序正常工作。 But it is too slow. 但它太慢了。 The processing in C# is pretty heavy duty and takes too long. C#中的处理非常繁重,耗时太长。 I want to turn this into a multi-threaded app. 我想把它变成一个多线程的应用程序。

What is the best way to do this? 做这个的最好方式是什么? We are using Linq to Sql. 我们正在使用Linq to Sql。

So far my approach has been to create a new DataContext object for each batch of records from main and use ThreadPool.QueueUserWorkItem to process it. 到目前为止,我的方法是为main每批记录创建一个新的DataContext对象,并使用ThreadPool.QueueUserWorkItem来处理它。 However these batches are stepping on each other's toes because one thread adds a record and then the next thread tries to add the same one and ... I am getting all kinds of interesting SQL Server dead locks. 然而,这些批次踩到彼此的脚趾,因为一个线程添加一个记录,然后下一个线程尝试添加相同的一个...我得到各种有趣的SQL Server死锁。

Here is the code: 这是代码:

    int skip = 0;
    List<int> thisBatch;
    Queue<List<int>> allBatches = new Queue<List<int>>();
    do
    {
        thisBatch = allIds
                .Skip(skip)
                .Take(numberOfRecordsToPullFromDBAtATime).ToList();
        allBatches.Enqueue(thisBatch);
        skip += numberOfRecordsToPullFromDBAtATime;

    } while (thisBatch.Count() > 0);

    while (allBatches.Count() > 0)
    {
        RRDataContext rrdc = new RRDataContext();

        var currentBatch = allBatches.Dequeue();
        lock (locker)  
        {
            runningTasks++;
        }
        System.Threading.ThreadPool.QueueUserWorkItem(x =>
                    ProcessBatch(currentBatch, rrdc));

        lock (locker) 
        {
            while (runningTasks > MAX_NUMBER_OF_THREADS)
            {
                 Monitor.Wait(locker);
                 UpdateGUI();
            }
        }
    }

And here is ProcessBatch: 这是ProcessBatch:

    private static void ProcessBatch( 
        List<int> currentBatch, RRDataContext rrdc)
    {
        var topRecords = GetTopRecords(rrdc, currentBatch);
        CalculateDetails(rrdc, topRecords);
        rrdc.Dispose();

        lock (locker)
        {
            runningTasks--;
            Monitor.Pulse(locker);
        };
    }

And

    private static List<Record> GetTopRecords(RecipeRelationshipsDataContext rrdc, 
                                              List<int> thisBatch)
    {
        List<Record> topRecords;

        topRecords = rrdc.Records
                    .Where(x => thisBatch.Contains(x.Id))
                    .OrderBy(x => x.OrderByMe).ToList();
        return topRecords;
    }

CalculateDetails is best explained by the pseudo-code at the top. CalculateDetails最好用顶部的伪代码来解释。

I think there must be a better way to do this. 我认为必须有更好的方法来做到这一点。 Please help. 请帮忙。 Many thanks! 非常感谢!

Here's my take on the problem: 这是我对这个问题的看法:

  • When using multiple threads to insert/update/query data in SQL Server, or any database, then deadlocks are a fact of life. 当使用多个线程在SQL Server或任何数据库中插入/更新/查询数据时,死锁是生活中的事实。 You have to assume they will occur and handle them appropriately. 你必须假设它们会发生并适当地处理它们。

  • That's not so say we shouldn't attempt to limit the occurence of deadlocks. 事实并非如此,我们不应该试图限制死锁的发生。 However, it's easy to read up on the basic causes of deadlocks and take steps to prevent them, but SQL Server will always surprise you :-) 但是,很容易阅读死锁的基本原因并采取措施防止它们,但SQL Server总是会让你大吃一惊:-)

Some reason for deadlocks: 死锁的一些原因:

  • Too many threads - try to limit the number of threads to a minimum, but of course we want more threads for maximum performance. 线程太多 - 尝试将线程数限制到最小,但当然我们需要更多线程来获得最大性能。

  • Not enough indexes. 没有足够的索引。 If selects and updates aren't selective enough SQL will take out larger range locks than is healthy. 如果选择和更新没有足够的选择性,那么SQL将获取比健康更大的范围锁。 Try to specify appropriate indexes. 尝试指定适当的索引。

  • Too many indexes. 索引太多了。 Updating indexes causes deadlocks, so try to reduce indexes to the minimum required. 更新索引会导致死锁,因此请尝试将索引减少到所需的最小值。

  • Transaction isolational level too high. 交易隔离级别太高。 The default isolation level when using .NET is 'Serializable', whereas the default using SQL Server is 'Read Committed'. 使用.NET时的默认隔离级别是“Serializable”,而使用SQL Server的默认隔离级别是“Read Committed”。 Reducing the isolation level can help a lot (if appropriate of course). 降低隔离级别可以提供很多帮助(当然,如果适当的话)。

This is how I might tackle your problem: 这就是我可以解决你的问题的方法:

  • I wouldn't roll my own threading solution, I would use the TaskParallel library. 我不会推出自己的线程解决方案,我会使用TaskParallel库。 My main method would look something like this: 我的主要方法看起来像这样:

     using (var dc = new TestDataContext()) { // Get all the ids of interest. // I assume you mark successfully updated rows in some way // in the update transaction. List<int> ids = dc.TestItems.Where(...).Select(item => item.Id).ToList(); var problematicIds = new List<ErrorType>(); // Either allow the TaskParallel library to select what it considers // as the optimum degree of parallelism by omitting the // ParallelOptions parameter, or specify what you want. Parallel.ForEach(ids, new ParallelOptions {MaxDegreeOfParallelism = 8}, id => CalculateDetails(id, problematicIds)); } 
  • Execute the CalculateDetails method with retries for deadlock failures 执行CalculateDetails方法并重试死锁失败

     private static void CalculateDetails(int id, List<ErrorType> problematicIds) { try { // Handle deadlocks DeadlockRetryHelper.Execute(() => CalculateDetails(id)); } catch (Exception e) { // Too many deadlock retries (or other exception). // Record so we can diagnose problem or retry later problematicIds.Add(new ErrorType(id, e)); } } 
  • The core CalculateDetails method 核心CalculateDetails方法

     private static void CalculateDetails(int id) { // Creating a new DeviceContext is not expensive. // No need to create outside of this method. using (var dc = new TestDataContext()) { // TODO: adjust IsolationLevel to minimize deadlocks // If you don't need to change the isolation level // then you can remove the TransactionScope altogether using (var scope = new TransactionScope( TransactionScopeOption.Required, new TransactionOptions {IsolationLevel = IsolationLevel.Serializable})) { TestItem item = dc.TestItems.Single(i => i.Id == id); // work done here dc.SubmitChanges(); scope.Complete(); } } } 
  • And of course my implementation of a deadlock retry helper 当然我执行死锁重试帮助器

     public static class DeadlockRetryHelper { private const int MaxRetries = 4; private const int SqlDeadlock = 1205; public static void Execute(Action action, int maxRetries = MaxRetries) { if (HasAmbientTransaction()) { // Deadlock blows out containing transaction // so no point retrying if already in tx. action(); } int retries = 0; while (retries < maxRetries) { try { action(); return; } catch (Exception e) { if (IsSqlDeadlock(e)) { retries++; // Delay subsequent retries - not sure if this helps or not Thread.Sleep(100 * retries); } else { throw; } } } action(); } private static bool HasAmbientTransaction() { return Transaction.Current != null; } private static bool IsSqlDeadlock(Exception exception) { if (exception == null) { return false; } var sqlException = exception as SqlException; if (sqlException != null && sqlException.Number == SqlDeadlock) { return true; } if (exception.InnerException != null) { return IsSqlDeadlock(exception.InnerException); } return false; } } 
  • One further possibility is to use a partitioning strategy 另一种可能性是使用分区策略

If your tables can naturally be partitioned into several distinct sets of data, then you can either use SQL Server partitioned tables and indexes , or you could manually split your existing tables into several sets of tables. 如果您的表可以自然地分成几个不同的数据集,那么您可以使用SQL Server分区表和索引 ,也可以手动将现有表拆分为多组表。 I would recommend using SQL Server's partitioning, since the second option would be messy. 我建议使用SQL Server的分区,因为第二个选项会很混乱。 Also built-in partitioning is only available on SQL Enterprise Edition. 此外,内置分区仅适用于SQL Enterprise Edition。

If partitioning is possible for you, you could choose a partion scheme that broke you data in lets say 8 distinct sets. 如果你可以进行分区,你可以选择一个分裂方案来打破你的数据,比如8个不同的集合。 Now you could use your original single threaded code, but have 8 threads each targetting a separate partition. 现在您可以使用原始的单线程代码,但每个目标有一个单独的分区。 Now there won't be any (or at least a minimum number of) deadlocks. 现在不会有任何(或至少是最小数量的)死锁。

I hope that makes sense. 我希望这是有道理的。

Overview 概观

The root of your problem is that the L2S DataContext, like the Entity Framework's ObjectContext, is not thread-safe. 您的问题的根源是L2S DataContext,如Entity Framework的ObjectContext,不是线程安全的。 As explained in this MSDN forum exchange , support for asynchronous operations in the .NET ORM solutions is still pending as of .NET 4.0; 正如在MSDN论坛交流中所解释的那样,从.NET 4.0开始,对.NET ORM解决方案中的异步操作的支持仍然悬而未决; you'll have to roll your own solution, which as you've discovered isn't always easy to do when your framework assume single-threadedness. 你必须推出自己的解决方案,正如你所发现的那样,当你的框架采用单线程时,这并不总是很容易。

I'll take this opportunity to note that L2S is built on top of ADO.NET, which itself fully supports asynchronous operation - personally, I would much prefer to deal directly with that lower layer and write the SQL myself, just to make sure that I fully understood what was transpiring over the network. 我将借此机会指出,L2S建立在ADO.NET之上,ADO.NET本身完全支持异步操作 - 就个人而言,我更倾向于直接处理较低层并自己编写SQL,以确保我完全理解网络上发生的事情。

SQL Server Solution? SQL Server解决方案?

That being said, I have to ask - must this be a C# solution? 话虽如此,我不得不问 - 这必须是C#解决方案吗? If you can compose your solution out of a set of insert/update statements, you can just send over the SQL directly and your threading and performance problems vanish.* It seems to me that your problems are related not to the actual data transformations to be made, but center around making them performant from .NET. 如果你可以从一组插入/更新语句中编写解决方案,你可以直接发送SQL,你的线程和性能问题就会消失。*在我看来,你的问题与实际的数据转换无关。制作,但围绕使他们从.NET的表现。 If .NET is removed from the equation, your task becomes simpler. 如果从等式中删除.NET,则任务变得更简单。 After all, the best solution is often the one that has you writing the smallest amount of code, right? 毕竟,最好的解决方案通常是你编写最少量代码的解决方案,对吧? ;) ;)

Even if your update/insert logic can't be expressed in a strictly set-relational manner, SQL Server does have a built-in mechanism for iterating over records and performing logic - while they are justly maligned for many use cases, cursors may in fact be appropriate for your task. 即使你的更新/插入逻辑不能以严格的设置 - 关系方式表达,SQL Server确实有一个用于迭代记录和执行逻辑的内置机制 - 虽然它们在许多用例中被公正地诽谤,但是游标可能在事实上适合你的任务。

If this is a task that has to happen repeatedly, you could benefit greatly from coding it as a stored procedure. 如果这是一个必须重复发生的任务,那么将其编码为存储过程可以大大受益。

*of course, long-running SQL brings its own problems like lock escalation and index usage that you'll have to contend with. *当然,长时间运行的SQL带来了自己的问题,比如锁定升级和索引使用,你将不得不面对这些问题。

C# Solution C#解决方案

Of course, it may be that doing this in SQL is out of the question - maybe your code's decisions depend on data that comes from elsewhere, for example, or maybe your project has a strict 'no-SQL-allowed' convention. 当然,可能在SQL中执行此操作是不可能的 - 例如,您的代码决策可能依赖于来自其他地方的数据,或者您的项目可能具有严格的“不允许SQL”约定。 You mention some typical multithreading bugs, but without seeing your code I can't really be helpful with them specifically. 你提到了一些典型的多线程错误,但是如果没有看到你的代码,我就无法真正对它们有所帮助。

Doing this from C# is obviously viable, but you need to deal with the fact that a fixed amount of latency will exist for each and every call you make. 从C#执行此操作显然是可行的,但您需要处理这样一个事实:每次调用都会存在固定数量的延迟。 You can mitigate the effects of network latency by using pooled connections, enabling multiple active result sets, and using the asynchronous Begin/End methods for executing your queries. 您可以通过使用池化连接,启用多个活动结果集以及使用异步Begin / End方法来执行查询来缓解网络延迟的影响。 Even with all of those, you will still have to accept that there is a cost to shipping data from SQL Server to your application. 即使有了所有这些,您仍然必须接受将数据从SQL Server发送到您的应用程序的成本。

One of the best ways to keep your code from stepping all over itself is to avoid sharing mutable data between threads as much as possible. 保持代码不受限制的最好方法之一是尽可能避免在线程之间共享可变数据。 That would mean not sharing the same DataContext across multiple threads. 这意味着不跨多个线程共享相同的DataContext。 The next best approach is to lock critical sections of code that touch the shared data - lock blocks around all DataContext access, from the first read to the final write. 下一个最好的方法是锁定触摸共享数据的关键代码段 - 从第一次读取到最终写入, lock所有DataContext访问的块。 That approach might just obviate the benefits of multithreading entirely; 这种方法可能完全消除了多线程的好处; you can likely make your locking more fine-grained, but be ye warned that this is a path of pain. 你可能会使你的锁定更精细,但你要警告这是一条痛苦的道路。

Far better is to keep your operations separate from each other entirely. 更好的办法是让您的运营完全分开。 If you can partition your logic across 'main' records, that's ideal - that is to say, as long as there aren't relationships between the various child tables, and as long as one record in 'main' doesn't have implications for another, you can split your operations across multiple threads like this: 如果你可以在“主要”记录中划分你的逻辑,这是理想的 - 也就是说,只要各种子表之间没有关系,并且只要“main”中的一条记录没有影响另外,您可以跨多个线程拆分操作,如下所示:

private IList<int> GetMainIds()
{
    using (var context = new MyDataContext())
        return context.Main.Select(m => m.Id).ToList();
}

private void FixUpSingleRecord(int mainRecordId)
{
    using (var localContext = new MyDataContext())
    {
        var main = localContext.Main.FirstOrDefault(m => m.Id == mainRecordId);

        if (main == null)
            return;

        foreach (var childOneQuality in main.ChildOneQualities)
        {
            // If child one is not found, create it
            // Create the relationship if needed
        }

        // Repeat for ChildTwo and ChildThree

        localContext.SaveChanges();
    }
}

public void FixUpMain()
{
    var ids = GetMainIds();
    foreach (var id in ids)
    {
        var localId = id; // Avoid closing over an iteration member
        ThreadPool.QueueUserWorkItem(delegate { FixUpSingleRecord(id) });
    }
}

Obviously this is as much a toy example as the pseudocode in your question, but hopefully it gets you thinking about how to scope your tasks such that there is no (or minimal) shared state between them. 显然,这与您的问题中的伪代码一样是一个玩具示例,但希望它能让您考虑如何确定任务的范围,使得它们之间没有(或最小)共享状态。 That, I think, will be the key to a correct C# solution. 我认为,这将是正确的C#解决方案的关键。

EDIT Responding to updates and comments 编辑响应更新和评论

If you're seeing data consistency issues, I'd advise enforcing transaction semantics - you can do this by using a System.Transactions.TransactionScope (add a reference to System.Transactions). 如果您看到数据一致性问题,我建议强制执行事务语义 - 您可以使用System.Transactions.TransactionScope(添加对System.Transactions的引用)来执行此操作。 Alternately, you might be able to do this on an ADO.NET level by accessing the inner connection and calling BeginTransaction on it (or whatever the DataConnection method is called). 或者,您可以通过访问内部连接并在其上调用BeginTransaction (或调用任何DataConnection方法)在ADO.NET级别上执行此操作。

You also mention deadlocks. 你还提到了死锁。 That you're battling SQL Server deadlocks indicates that the actual SQL queries are stepping on each other's toes. 您正在与SQL Server死锁作斗争表明实际的SQL查询正在踩到彼此的脚趾。 Without knowing what is actually being sent over the wire, it's difficult to say in detail what's happening and how to fix it. 在不知道实际通过网络发送什么的情况下,很难详细说明发生了什么以及如何解决它。 Suffice to say that SQL deadlocks result from SQL queries, and not necessarily from C# threading constructs - you need to examine what exactly is going over the wire. 可以说SQL死锁是由SQL查询引起的,而不一定是来自C#线程构造 - 你需要检查究竟是通过线路进行的。 My gut tells me that if each 'main' record is truly independent of the others, then there shouldn't be a need for row and table locks, and that Linq to SQL is likely the culprit here. 我的直觉告诉我,如果每个'main'记录真正独立于其他记录,那么就不需要行和表锁,并且Linq to SQL可能是这里的罪魁祸首。

You can get a dump of the raw SQL emitted by L2S in your code by setting the DataContext.Log property to something eg Console.Out. 通过将DataContext.Log属性设置为例如Console.Out,您可以在代码中获取L2S发出的原始SQL的转储。 Though I've never personally used it, I understand the LINQPad offers L2S facilities and you may be able to get at the SQL there, too. 虽然我从未亲自使用它,但我知道LINQPad提供了L2S设施,你也可以在那里获得SQL。

SQL Server Management Studio will get you the rest of the way there - using the Activity Monitor, you can watch for lock escalation in real time. SQL Server Management Studio将为您提供剩余的工作 - 使用活动监视器,您可以实时监视锁定升级。 Using the Query Analyzer, you can get a view of exactly how SQL Server will execute your queries. 使用查询分析器,您可以查看SQL Server将如何执行查询。 With those, you should be able to get a good notion of what your code is doing server-side, and in turn how to go about fixing it. 有了这些,你应该能够很好地理解你的代码在服务器端做什么,反过来又如何解决它。

I would recommend moving all the XML processing into the SQL server, too. 我建议也将所有XML处理移动到SQL服务器中。 Not only will all your deadlocks disappear, but you will see such a boost in performance that you will never want to go back. 你的所有僵局不仅会消失,而且你会看到性能的提升,你永远不会想要回归。

It will be best explained by an example. 最好用一个例子来解释。 In this example I assume that the XML blob already is going into your main table (I call it closet). 在这个例子中,我假设XML blob已经进入你的主表(我称之为壁橱)。 I will assume the following schema: 我将假设以下架构:

CREATE TABLE closet (id int PRIMARY KEY, xmldoc ntext) 
CREATE TABLE shoe(id int PRIMARY KEY IDENTITY, color nvarchar(20))
CREATE TABLE closet_shoe_relationship (
    closet_id int REFERENCES closet(id),
    shoe_id int REFERENCES shoe(id)
)

And I expect that your data (main table only) initially looks like this: 我希望您的数据(仅限主表)最初看起来像这样:

INSERT INTO closet(id, xmldoc) VALUES (1, '<ROOT><shoe><color>blue</color></shoe></ROOT>')
INSERT INTO closet(id, xmldoc) VALUES (2, '<ROOT><shoe><color>red</color></shoe></ROOT>')

Then your whole task is as simple as the following: 然后你的整个任务就像下面这样简单:

INSERT INTO shoe(color) SELECT DISTINCT CAST(CAST(xmldoc AS xml).query('//shoe/color/text()') AS nvarchar) AS color from closet
INSERT INTO closet_shoe_relationship(closet_id, shoe_id) SELECT closet.id, shoe.id FROM shoe JOIN closet ON CAST(CAST(closet.xmldoc AS xml).query('//shoe/color/text()') AS nvarchar) = shoe.color

But given that you will do a lot of similar processing, you can make your life easier by declaring your main blob as XML type, and further simplifying to this: 但是考虑到你将进行大量类似的处理,你可以通过将主blob声明为XML类型来简化生活,并进一步简化为:

INSERT INTO shoe(color)
    SELECT DISTINCT CAST(xmldoc.query('//shoe/color/text()') AS nvarchar)
    FROM closet
INSERT INTO closet_shoe_relationship(closet_id, shoe_id)
    SELECT closet.id, shoe.id
    FROM shoe JOIN closet
        ON CAST(xmldoc.query('//shoe/color/text()') AS nvarchar) = shoe.color

There are additional performance optimizations possible, like pre-computing repeatedly invoked Xpath results in a temporary or permanent table, or converting the initial population of the main table into a BULK INSERT, but I don't expect that you will really need those to succeed. 可以进行额外的性能优化,例如在临时或永久表中预先计算重复调用的Xpath结果,或者将主表的初始填充转换为BULK INSERT,但我不认为您真的需要这些才能成功。

sql server deadlocks are normal & to be expected in this type of scenario - MS's recommendation is that these should be handled on the application side rather than the db side. sql server死锁是正常的并且在这种情况下是预期的 - MS的建议是这些应该在应用程序端而不是数据库端处理。

However if you do need to make sure that a stored procedure is only called once then you can use a sql mutex lock using sp_getapplock. 但是,如果确实需要确保只调用一次存储过程,则可以使用sp_getapplock使用sql mutex锁。 Here's an example of how to implement this 这是一个如何实现这个的例子

BEGIN TRAN
DECLARE @mutex_result int;
EXEC @mutex_result = sp_getapplock @Resource = 'CheckSetFileTransferLock',
 @LockMode = 'Exclusive';

IF ( @mutex_result < 0)
BEGIN
    ROLLBACK TRAN

END

-- do some stuff

EXEC @mutex_result = sp_releaseapplock @Resource = 'CheckSetFileTransferLock'
COMMIT TRAN  

This may be obvious, but looping through each tuple and doing your work in your servlet container involves a lot of per-record overhead. 这可能是显而易见的,但循环遍历每个元组并在servlet容器中完成工作涉及大量的每个记录开销。

If possible, move some or all of that processing to the SQL server by rewriting your logic as one or more stored procedures. 如果可能,通过将逻辑重写为一个或多个存储过程,将部分或全部处理移动到SQL Server。

If 如果

  • You don't have a lot of time to spend on this issue and need it to fix it right now 您没有太多时间花在这个问题上,需要立即修复它
  • You are sure that your code is done so that different thread will NOT modify the same record 您确定您的代码已完成,以便不同的线程不会修改相同的记录
  • You are not afraid 你不害怕

Then ... you can just add "WITH NO LOCK" to your queries so that MSSQL doesn't apply the locks. 然后......您可以在查询中添加“WITH NO LOCK”,以便MSSQL不应用锁定。

To use with caution :) 谨慎使用:)

But anyway, you didn't tell us where the time is lost (in the mono-threaded version). 但无论如何,你没有告诉我们时间丢失的地方(在单线程版本中)。 Because if it's in the code, I'll advise you to write everything in the DB directly to avoid continuous data exchange. 因为如果它在代码中,我建议你直接在DB中写入所有内容,以避免连续的数据交换。 If it's in the db, I'll advise to check index (too much ?), i/o, cpu etc. 如果它在数据库中,我会建议检查索引(太多?),i / o,cpu等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM