简体   繁体   English

关于在jdbc中使用多线程的教程

[英]Tutorial about Using multi-threading in jdbc

Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example. 我们公司有一个每天运行的批处理应用程序,它主要做一些与数据库相关的工作,例如从文件导入数据到数据库表。

There are 20+ tasks defined in that application, each one may depends on other ones or not. 该应用程序中定义了20多个任务,每个任务可能依赖于其他任务。 The application execute tasks one by one, the whole application runs in a single thread. 应用程序逐个执行任务,整个应用程序在单个线程中运行。

It takes 3~7 hours to finish all the tasks. 完成所有任务需要3~7个小时。 I think it's too long, so I think maybe I can improve performance by multi-threading. 我认为它太长了,所以我想也许我可以通过多线程提高性能。

I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task. 我认为由于任务之间存在依赖关系,因此并行运行任务并不好(或者说并不容易),但也许我可以使用多线程来提高任务内部的性能。

for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). 例如:我们有一个定义为“ImportBizData”的任务,它将数据从数据文件复制到数据库表中(通常包含100,0000多行)。 I wonder is that worth to use multi-threading? 我想知道是否值得使用多线程?

As I know a little about multi-threading, I hope some one provide some tutorial links on this topic. 正如我对多线程知之甚少,我希望有人提供一些关于这个主题的教程链接。

Multi-threading will improve your performance but there are a couple of things you need to know: 多线程将提高您的性能,但您需要了解以下几点:

  1. Each thread needs its own JDBC connection. 每个线程都需要自己的JDBC连接。 Connections can't be shared between threads because each connection is also a transaction. 线程之间不能共享连接,因为每个连接也是一个事务。
  2. Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables. 以块的形式上传数据并commit一次,以避免累积大量的回滚/撤消表。
  3. Cut tasks into several work units where each unit does one job. 将任务分成几个工作单位,每个单位完成一项工作。

To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc. 详细说明最后一点:目前,您有一个任务,它可以读取文件,解析它,打开JDBC连接,进行一些计算,将数据发送到数据库等。

What you should do: 你应该做什么:

  1. One (!) thread to read the file and create "jobs" out of it. 一个(!)线程来读取文件并从中创建“作业”。 Each job should contains a small, but not too small "unit of work". 每份工作都应该包含一个小而不是太小的“工作单元”。 Push those into a queue 将它们推入队列
  2. The next thread(s) wait(s) for jobs in the queue and do the calculations. 下一个线程等待队列中的作业并进行计算。 This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. 当步骤#1中的线程等待慢速硬盘返回新的数据行时,可能会发生这种情况。 The result of this conversion step goes into the next queue 此转换步骤的结果将进入下一个队列
  3. One or more threads to upload the data via JDBC. 一个或多个通过JDBC上传数据的线程。

The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). 第一个和最后一个线程非常慢,因为它们受I / O限制(硬盘速度慢,网络连接更糟)。 Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys) 在数据库中插入数据是一项非常复杂的任务(分配空间,更新索引,检查外键)

Using different worker threads gives you lots of advantages: 使用不同的工作线程可以提供很多优势:

  1. It's easy to test each thread separately. 分别测试每个线程很容易。 Since they don't share data, you need no synchronization. 由于它们不共享数据,因此无需同步。 The queues will do that for you 队列将为您做到这一点
  2. You can quickly change the number of threads for each step to tweak performance 您可以快速更改每个步骤的线程数以调整性能

Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. 多线程可能有所帮助,如果线路不相关,您可以开始两个进程,一个读取偶数行,另一个不均匀的行,并从连接池(dbcp)获取数据库连接并分析性能。 But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. 但首先我要调查jdbc是否是最好的方法,通常数据库已经为这样的导入优化了解决方案。 These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. 这些解决方案还可以临时切换对表的约束检查,并在以后将其重新打开,这对性能也很有帮助。 As always depending on your requirements. 一如既往地视您的要求而定。

Also you may want to checkout springbatch which is designed for batch processing. 您也可以查看专为批处理设计的弹簧批量。

I had a similar task . 我有类似的任务 But in my case, all the tables were unrelated to each other. 但就我而言,所有表格都是彼此无关的。

STEP1: Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database. 第1步:使用SQL Loader(Oracle)将数据上传到数据库(非常快)或任何类似的数据库批量更新工具。

STEP2: Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks. 第2步:在不同的线程(针对不相关的任务)和单个线程中运行每个上载过程以执行相关任务。

PS You could identify different inter-related jobs in your application and categorize them in groups; PS您可以在应用程序中识别不同的相互关联的作业,并将它们分组; and running each group in different threads. 并在不同的线程中运行每个组。

Links to run you up: 运行你的链接:

JAVA Threading follow the last example in the above link(Example: Partitioning a large task with multiple threads) JAVA线程遵循上面链接中的最后一个示例(示例:使用多个线程对大型任务进行分区)

SQL Loader can dramatically improve performance SQL Loader可以显着提高性能

The fastest way I've found to insert large numbers of records into Oracle is with array operations. 我发现将大量记录插入Oracle的最快方法是使用数组操作。 See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. 请参阅“setExecuteBatch”方法,该方法特定于OraclePreparedStatement。 It's described in one of the examples here: http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc 它在这里的一个例子中描述: http//betteratoracle.com/posts/25-array-batch-inserts-with-jdbc

If Multi threading would complicate your work, you could go with Async messaging. 如果多线程会使您的工作复杂化,您可以使用Async消息传递。 I'm not fully aware of what your needs are, so, the following is from what I am seeing currently. 我不完全清楚你的需求是什么,所以,以下是我目前所看到的。

  1. Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. 创建一个文件读取器java,其目的是读取biz文件并将消息放入服务器上的JMS队列。 This could be plain Java with static void main() 这可能是普通的Java with static void main()
  2. Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers. 在消息驱动的bean中使用JMS消息(如果你有多个服务器,可以设置池中创建的bean数量限制,50或100,具体取决于需要),你的工作现在已经拆分了进入多个服务器。
    1. Each row of data is asynchronously split between 2 servers and 50 beans on each server. 每行数据在每台服务器上的2个服务器和50个bean之间异步分配。

You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading. 您不必在整个过程中处理线程,JMS是理想的,因为您的数据在事务中,如果在您向服务器发送确认之前某些事情失败,则消息将重新发送给消费者,负载将被分割服务器之间没有你做任何特殊的事情,如多线程。

Also, spring is providing spring-batch which can help you. 此外,春天提供春季批次,可以帮助您。 http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios

据我所知,JDBC Bridge使用同步方法序列化对ODBC的所有调用,因此使用mutliple线程不会给你任何性能提升,除非它提升你的应用程序本身。

I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). 我不是那么熟悉JDBC,但是关于你的问题的多线程位,你应该记住的是并行处理依赖于有效地将你的问题分成彼此独立的位,并以某种方式将它们重新组合在一起(他们的输出是)。 If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. 如果你不知道任务之间的潜在依赖关系,你可能最终会在代码中出现奇怪的错误/异常。 Even worse, it might all execute without any problems, but the results might be off from true values. 更糟糕的是,它可能都没有任何问题地执行,但结果可能与真实值不同。 Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south. 多线程是一项棘手的业务,在某种程度上有趣的学习(至少我是这么认为),但当事情向南发展时,脖子上会感到痛苦。

Here are a couple of links that might provide useful: 以下是一些可能有用的链接:

If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really.. 如果你认真考虑进入多线程,我可以推荐GOETZ,BRIAN:JAVA CONCURRENCY,真是太棒了......

Good luck 祝好运

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM