简体繁体 English

一个读者线程，一个写入器线程，n个工作线程

[英]One reader thread, one writer thread, n worker threads

原文 2010-12-08 22:28:22 7 5 java/ sql/ concurrency/ large-data-volumes

I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and then persisted back to DB. 我正在尝试用Java开发一段代码，它能够处理来自SQL数据库的JDBC驱动程序提取的大量数据，然后再保存回DB。

I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. 我想过创建一个包含一个读者线程，一个编写器线程和可自定义数量的工作线程处理数据的管理器。 The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. 读者线程将数据读取到DTO并将它们传递给标记为“准备处理”的队列。 Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. 工作线程将处理DTO并将处理过的对象放到标记为“准备好持久性”的另一个队列中。 The writer thread would persist data back to DB. 编写器线程会将数据保留回DB。 Is such an approach optimal? 这种方法是否最佳？ Or perhaps I should allow more readers for fetching data? 或许我应该允许更多读者获取数据？ Are there any ready libraries in Java for doing this sort of thing I am not aware of? Java中是否有任何现成的库用于执行我不知道的这类事情？

5 个解决方案

Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. 您提出的方法是否是最优的，关键取决于处理数据的成本与从数据库获取数据以及将结果写回数据库的成本有多大。 If the processing is relatively expensive, this may work well; 如果处理相对昂贵，这可能会很好; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.) 如果不是这样，您可能会引入相当大的复杂性而几乎没有任何好处（您仍然会获得管道并行性，这可能会对整体吞吐量产生影响，也可能不会产生重大影响。）

The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design. 唯一可以确定的方法是分别对三个阶段进行基准测试，然后对最佳设计进行预测。

Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. 如果要采用多线程方法，那么您的两个队列设计听起来很合理。 One additional thing you may want to consider is having a limit on the size of each queue. 您可能需要考虑的另一件事是限制每个队列的大小。

I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. 我听到过去的回声，我想提供一种不同的方法，万一你要重复我的错误。 It may or may not be applicable to your situation. 它可能适用于您的情况，也可能不适用。

You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database. 您写道，您需要从数据库中获取大量数据，然后再保留回数据库。

Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? 是否可以将您需要使用的任何外部数据临时插入数据库，并在数据库中执行所有处理？ This would offer the following advantages: 这将提供以下优势：

It eliminates the need to extract large amounts of data 它消除了提取大量数据的需要
It eliminates the need to persist large amounts of data 它消除了持久存储大量数据的需要
It enables set-based processing (which outperforms procedural) 它支持基于集合的处理（优于程序）
If your database supports it, you can make use of parallel execution 如果您的数据库支持它，您可以使用并行执行
It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process. 它为您提供了一个框架（表和SQL），用于报告您在此过程中遇到的任何错误。

To give an example. 举个例子。 A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. 很久以前我实现了一个（java）程序，其目的是将文件中的购买，支付和相关客户数据加载到中央数据库中。 At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. 那时（我深感遗憾），我设计了负载来逐个处理事务，对于每个数据，执行几个数据库查找（sql），最后在适当的表中插入一些。 Naturally this did not scale once the volume increased. 当然，一旦体积增加，这不会扩大。

Then I made another misstake. 然后我又犯了一个错误。 I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. 我认为这是问题的数据库（因为我听说 SELECT很慢），所以我决定从数据库中提取所有数据并用Java进行所有处理。 And then finally persist back all data to the database. 然后最终将所有数据保留到数据库中。 I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well. 我使用回调机制实现了各种层，以便轻松扩展加载过程，但我无法让它表现良好。

Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. 看着后视镜，我应该做的是将（可笑的少量）100,000行临时插入一张桌子，并从那里处理它们。 What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal. 如果我发挥了我所拥有的所有技术的力量，那么花了将近半天的时间进行处理最多只需要几分钟。

An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. 使用显式队列的另一种方法是使用ExecutorService并向其添加任务。 This way you let Java manager the pool of threads. 这样您就可以让Java管理器成为线程池。

You're describing writing something similar to the functionality that Spring Batch provides. 您正在描述类似于Spring Batch提供的功能。 I'd check that out if I were you. 如果我是你，我会检查一下。 I've had great luck doing operations similar to what you're describing using it. 我很幸运做了类似你用它描述的操作。 Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided. 提供并行和多线程处理，以及几个不同的数据库读取器/写入器和一大堆其他东西。