简体繁体 English

数据库优化中的多线程选择行

[英]Multiple thread selecting row from database optimisation

原文 2013-07-05 12:08:38 9 3 java/ multithreading/ database-design/ data-binding

I have an java application where 15 threads select a row from table with 11,000 records, through a synchronized method called getNext(), the threads are getting slow at selection a row, thereby taking a huge amount of time. 我有一个Java应用程序，其中有15个线程通过一个名为getNext（）的同步方法从具有11,000条记录的表中选择一行，这些线程在选择一行时变慢，从而花费了大量的时间。 Each of the thread follows the following process: 每个线程都遵循以下过程：

Thread checks if a row with resume column value set to 1 exist. 线程检查是否将恢复列值设置为1的行存在。
A. If it exist the thread takes the id of that row and uses that id to select another row with id greater than that of the taking id. 答：如果存在，线程将获取该行的ID，然后使用该ID选择ID大于接受ID的另一行。
B. Otherwise it select's a row with id greater than 0. B.否则，选择ID大于0的行。
The last row received based on the outcome of steps described in 1 above is marked with the resume column set to 1. 根据上面1中描述的步骤的结果收到的最后一行，将简历列设置为1。
The threads takes the row data and works on it. 线程获取行数据并对其进行处理。

Question: 题：

How can multiple thread access thesame table selecting rows that another thread has not selected and be fast? 多线程如何访问同一表并选择另一个线程未选择的行并且快速？
How can threads be made to resume in case of a crash at the last row that was selected by any of the threads? 在任何一个线程选择的最后一行发生崩溃的情况下，如何使线程恢复？

3 个解决方案

1.: It seems the multiple database operations in getNext() art the bottleneck. 1 .：似乎getNext（）中的多个数据库操作成为瓶颈。 If the data isn't change by an outside source you could read "id" and "resume" of all rows and cache it. 如果外部源没有更改数据，则可以读取所有行的“ id”和“ resume”并将其缓存。 Than you would only have one query and than operate just in memory for reads. 比起您只有一个查询，而不是仅在内存中进行读取操作。 This would safe lot of expensive DB calls in getNext(): 这样可以安全地在getNext（）中进行许多昂贵的数据库调用：

2.: Basically you need some sort of transactions or at least add an other column that gets updated when a thread has finished processing that row. 2：基本上，您需要某种事务或至少添加另一列，当线程完成处理该行时，该列将更新。 Basically the processing and the update need to happen in a single transaction. 基本上，处理和更新需要在单个事务中进行。 When something happens while the transaction is not finished, you can rollback to the state in which the row wasn't processed. 当事务未完成时发生某些事情时，您可以回滚到未处理该行的状态。

If the threads are all on the same machine they could use a shared data structure to avoid working on the same thing instead of synchronization. 如果所有线程都在同一台计算机上，则它们可以使用共享数据结构来避免在同一事物上工作，而不必进行同步。 But the following assumes the threads are on on different machines ( maybe different members of an application server cluster ) and can only communicate via the database. 但是以下内容假定线程位于不同的机器上（可能是应用程序服务器集群的不同成员），并且只能通过数据库进行通信。

Remove synchronization on getNext() method. 删除getNext（）方法上的同步。 When setting the resume flag to 1 (step 2), do so atomically. 将恢复标志设置为1（步骤2）时，请自动进行。 update table set resume=1 where resume = 0, commit. 更新表集resume = 1，其中resume = 0，提交。 Only one thread will succeed at this, the thread that does gets that unit of work. 只有一个线程会成功执行此操作，执行该任务的线程将获得该工作单元。 At the same time, set a resume time-- if the resume time is greater than some max assume the thread working on that unit of work hash crashed, set resume flag back to 0. After the work is finished set the resume time to null, or otherwise mark the work as done. 同时，设置恢复时间-如果恢复时间大于某个最大值，则假定工作在该工作单元哈希上的线程崩溃了，请将恢复标志设置回0。工作完成后，将恢复时间设置为null ，否则将工作标记为已完成。

Well, would think of different issues here: 好吧，这里会想到不同的问题：

Are you keeping status in your DB? 您是否在数据库中保持状态？ I would look for some approach where you call a select for update where you filter by inactive status (be sure just to get one row in the select) and immediately update to active (in same transaction). 我会寻找一种方法，您将其称为“选择更新”，并根据不活动状态进行过滤（确保仅在选择中获得一行），然后立即更新为活动状态（在同一事务中）。 It would be nice to know what DB you're using, not sure if "select for update" is always an option. 知道您使用的是哪个数据库将非常高兴，不确定是否始终选择“选择更新”。
Process and when you're finished, update to finished status. 处理，完成后，更新为完成状态。
Be sure to keep a timestamp in the table to identifiy when you changed status for the last time. 确保最后一次更改状态时在表中保留一个时间戳以进行标识。 Make yourself a rule to decide when an active thread will be treated as lost. 使自己成为一个规则，以决定何时将活动线程视为丢失。
Define other possible error scenarios (what happens if the process fails). 定义其他可能的错误方案（如果过程失败，将会发生的情况）。

You would also need to analyze the scenario. 您还需要分析场景。 How many rows does your table have? 您的表格有几行？ How many threads call it concurrently? 有多少个线程并发调用它？ How many inserts occur in a given time? 在给定的时间内发生了几次插入？ Depending on this you will have to see how DB performance is running. 取决于此，您将必须查看数据库性能如何运行。

I'm assuming you'r getNext() is synchronized, with what I wrote on point 1 you might get around this... 我假设您的getNext（）已同步，并且与我在点1上写的内容一样，您可能会解决此问题...