使用多线程处理一组数据库记录的选项？

Question

I have a database table that contains some records to be processed. 我有一个数据库表，其中包含一些要处理的记录。 The table has a flag column that represents the following status values. 该表有一个标志列，表示以下状态值。 1 - ready to be processed, 2- successfully processed, 3- processing failed. 1 - 准备好处理，2成功处理，3处理失败。

The .net code (repeating process - console/service) will grab a list of records that are ready to be processed, and loop through them and attempt to process them (Not very lengthy), update status based on success or failure. .net代码（重复进程 - 控制台/服务）将获取准备好处理的记录列表，并循环遍历它们并尝试处理它们（不是很冗长），根据成功或失败更新状态。

To have better performance, I want to enable multithreading for this process. 为了获得更好的性能，我想为此过程启用多线程。 I'm thinking to spawn say 6 threads, each threads grabbing a subset. 我想要产生说6个线程，每个线程抓住一个子集。

Obviously I want to avoid having different threads process the same records. 显然我想避免让不同的线程处理相同的记录。 I dont want to have a "Being processed" flag in the database to handle the case where the thread crashes leaving the record hanging. 我不希望在数据库中有一个“正在处理”标志来处理线程崩溃而使记录挂起的情况。

The only way I see doing this is to grab the complete list of available records and assigning a group (maybe ids) to each thread. 我看到这样做的唯一方法是获取可用记录的完整列表并为每个线程分配一个组（可能是ID）。 If an individual thread fails, its unprocessed records will be picked up next time the process runs. 如果单个线程失败，则下次进程运行时将拾取其未处理的记录。

Is there any other alternatives to dividing the groups prior to assigning them to threads? 在将组分配给线程之前是否有其他替代方法来划分组？

Answer 1

The most straightforward way to implement this requirement is to use the Task Parallel Library's 实现此要求的最直接方法是使用任务并行库

Parallel.ForEach (or Parallel.For ). Parallel.ForEach （或Parallel.For ）。

Allow it to manage individual worker threads. 允许它管理单个工作线程。

From experience, I would recommend the following: 根据经验，我建议如下：

Have an additional status "Processing" 有一个额外的状态“处理”
Have a column in the database that indicates when a record was picked up for processing and a cleanup task / process that runs periodically looking for records that have been "Processing" for far too long (reset the status to "ready for processing). 在数据库中有一个列，用于指示何时拾取记录以进行处理，以及定期运行的清理任务/进程，以查找“处理”时间太长的记录（将状态重置为“准备处理”）。
Even though you don't want it, "being processed" will be essential to crash recovery scenarios (unless you can tolerate the same record being processed twice). 即使您不想要它，“正在处理”对于崩溃恢复方案也是必不可少的（除非您可以容忍相同的记录被处理两次）。

Alternatively 另外

Consider using a transactional queue (MSMQ or Rabbit MQ come to mind). 考虑使用事务性队列（想到MSMQ或Rabbit MQ）。 They are optimized for this very problem. 它们针对这个问题进行了优化。

That would be my clear choice, having done both at massive scale. 这是我的明确选择，大规模完成。

Optimizing 优化

If it takes a non-trivial amount of time to retrieve data from the database, you can consider a Producer/Consumer pattern, which is quite straightforward to implement with a BlockingCollection . 如果从数据库中检索数据需要花费很多时间，您可以考虑生产者/消费者模式，这可以通过BlockingCollection实现。 That pattern allows one thread (producer) to populate a queue with DB records to be processed, and multiple other threads (consumers) to process items off of that queue. 该模式允许一个线程（生产者）使用要处理的DB记录填充队列，并允许多个其他线程（使用者）处理该队列之外的项目。

A New Alternative 一种新的选择

Given that several processing steps touch the record before it is considered complete, have a look at Windows Workflow Foundation as a possible alternative. 鉴于几个处理步骤在认为完成之前触摸了记录，请查看Windows Workflow Foundation作为可能的替代方案。

Answer 2

I remember doing something like what you described...A thread checks from time to time if there is something new in database that needs to be processed. 我记得做过你所描述的事情......线程会不时检查数据库中是否有新的东西需要处理。 It will load only the new ids, so if at time x last id read is 1000, at x+1 will read from id 1001. 它只会加载新的ID，所以如果在时间x最后一个id读取为1000，则x + 1将从id 1001读取。

Everything it reads goes into a thread safe Queue. 它读取的所有内容都进入了线程安全的队列。 When items are added to this queue, you notify the working threads (maybe use autoreset events, or spawn threads here). 将项添加到此队列时，您将通知工作线程（可能使用autoreset事件，或在此处生成线程）。 each thread will read from this thread safe queue one item at a time, until the queue is emptied. 每个线程将一次从该线程安全队列中读取一个项目，直到队列清空为止。

You should not assign before the work foreach thread (unless you know that foreach file the process takes the same amount of time). 您不应该在工作foreach线程之前分配（除非您知道foreach文件该过程花费相同的时间）。 if a thread finishes the work, then it should take the load from the other ones left. 如果一个线程完成了工作，那么它应该从剩下的其他线程中获取负载。 using this thread safe queue, you make sure of this. 使用此线程安全队列，您可以确保这一点。

Answer 3

Here is one approach that does not rely/use an additional database column (but see #4) or mandate an in-process queue. 这是一种不依赖/使用额外数据库列（但参见＃4）或强制进行中队列的方法。 The premise this approach is to "shard" records across workers based on some consistent value, much like a distributed cache. 这种方法的前提是基于一些一致的值来“划分”跨工作者的记录，就像分布式缓存一样。

Here are my assumptions: 以下是我的假设：

Re-processing does not cause unwanted side-effects; 重新处理不会产生不必要的副作用; at most some work "is wasted". 最多一些工作“浪费”。
The number of threads is fixed upon start-up. 线程数在启动时是固定的。 This is not a requirement, but it does simplify the implementation and allows me to skip transitory details in the simple description below. 这不是必需的，但它确实简化了实现，并允许我在下面的简单描述中跳过暂时的细节。
There is only one "worker process" (but see #1) controlling the "worker threads". 控制“工作线程”只有一个“工作进程”（但参见＃1）。 This simplifies dealing with how the records are split between workers. 这简化了处理记录在工人之间分配的方式。
There is some [immutable] "ID" column which is "well distributed". 有一些[immutable]“ID”列是“分布均匀”的。 This is required so search worker gets about the same amount of work. 这是必需的，因此搜索工作者可以获得相同数量的工作。
Work can be done "out of order" as long as it is "eventually done". 只要“最终完成”，工作就可以“无序”完成。 Also, workers might not always run "at 100%" due to each one effectively working on a different queue. 此外，由于每个人有效地在不同的队列上工作，工人可能并不总是“100％”运行。

Assign each thread a unique bucket value from [0, thread_count) . 从[0, thread_count)每个线程分配唯一的bucket值。 If a thread dies/is restarted it will take the same bucket as that which it vacated. 如果线程死亡/重新启动，它将占用与它腾出的相同的桶。

Then, each time a thread needs a new record is needed it will fetch from the database: 然后，每次线程需要新记录时，它将从数据库中获取：

SELECT *
FROM record
WHERE state = 'unprocessed'
AND (id % $thread_count) = $bucket
ORDER BY date

There could of course be other assumptions made about reading the "this threads tasks" in batch and storing them locally. 当然可以做出关于批量读取“这个线程任务”并在本地存储它们的其他假设。 A local queue, however, would be per thread (and thus re-loaded upon a new thread startup) and thus it would only deal with records associated for the given bucket . 但是，本地队列将是每个线程 （因此在新线程启动时重新加载），因此它只处理与给定bucket相关联的记录。

When the thread is finished processing a record should mark the record as processed using the appropriate isolation level and/or optimistic concurrency and proceed to the next record. 当线程完成处理记录时，应使用适当的隔离级别和/或乐观并发将记录标记为已处理，然后继续执行下一条记录。

使用多线程处理一组数据库记录的选项？

问题描述

3 个解决方案

解决方案1
6 已采纳 2012-06-06 19:16:47

解决方案2
2 2012-06-06 19:20:08

解决方案3
0

使用多线程处理一组数据库记录的选项？

问题描述

3 个解决方案

解决方案1 6 已采纳 2012-06-06 19:16:47

解决方案2 2 2012-06-06 19:20:08

解决方案3 0

解决方案1
6 已采纳 2012-06-06 19:16:47

解决方案2
2 2012-06-06 19:20:08

解决方案3
0