简体   繁体   中英

Number of DB Connections vs Java Threads

I am currently developing a java application that compares tables' data present in 2 different databases.

I am using connection pooling and Thread pool Executor service. I made the number of connections and threads configurable and hence trying to find an optimal number of connections and optimal number of threads required.

I know that the best way to get the optimal number is by trying out different numbers but my question is to what factors should I consider or how to calculate the number of connections/threads require.

There are typically 3000 tables to compare and tables' list/schema is available upfront and for time being assume that number of records in each table is few hundreds(so I don't need to query a table more than once).

Currently, my application spawns one thread(from thread pool) per one table and it makes 2 different db connections to 2 different databases(sequentially for now) and once the data is retrieved , the same thread calls a method that compares the data.

Here are few questions I have, say N is no. of cores and M is max no. of db connections the dbs can take

  1. If I have more threads than N, will that be useful for my use case? if yes, how?
  2. What is the limiting factor here - No. of cores or no. of connections?
  3. Does having more threads than M is of any use?

N is no. of cores and M is max no. of db connections the dbs can take

  1. If I have more threads than N, will that be useful for my use case? if yes, how?
  2. What is the limiting factor here - No. of cores or no. of connections?
  3. Does having more threads than M is of any use?
  1. Yes, spawning way more threads than cores will help, because at any given time some of the threads will be blocked doing I/O, at which time other threads can do processing.

  2. From the above it follows that the limiting factor is certainly not the number of cores. However, the number of connections may not be the limiting factor either. Of course you cannot exceed the number of connections, but you might find that you cannot even max that limit, in the sense that disk throughput (on the database server side) or network congestion might become a problem before you reach that limit.

  3. Having more threads than the max number of connections might yield some small benefit, if you make sure to a) obtain a connection from the connection pool, b) read all data in, c) release the connection back to the pool, and THEN d) do the comparing of the data. That's because while one thread is comparing data, another thread can use that connection to do its reading of data. However, comparing data sounds like a fairly simple and quick job to do, so the benefit will not be that great: your thread will be done comparing the data fairly quickly, after which it will want to obtain another connection from the pool, at which point it will be blocked if all connections are in use.

That having been said, I hope you are aware of the fact that there exist tools out there, even free tools, that will do these kinds of comparisons for you. Search for "SQL compare". (I know, it is a misnomer, the tools do not compare SQL, they compare databases, and they happen to use SQL to query the databases that they compare; I did not come up with the name, the creators of these tools did.)

The simple answer to your questions is "it depends"; ie there is no simple answer or magic formula.

Each database query you perform has steps that involve client-side computation, steps that require computation and disk I/O on the server, and steps that involve the transmission of the query and the results over the network. For any given query, these steps happen in a particular order. And the elapsed time to perform the query is the elapsed time taken to perform each of the steps, one after the other.

Let us assume that (for the sake of argument) the queries are independent; ie one query doesn't lock a resource that another one depends on.

Now if your workload is light enough (depending on the queries themselves and the number of client-side threads) then the individual steps of each query will consume more and more of the (relevant) resources available (CPU, I/O bandwidth). You can keep on increasing the number of client-side threads, but at some point, one of the resources is likely to be used at 100% ... and you will experience a bottleneck. Once you reach that point, increasing the number of client threads won't make queries any faster. Go too far and throughput will start to drop due to various resource contention effects.

Q: Can we predict what the throughput limit will be?

A: Not without a deep analysis of the entire system and workload, which is ... not practical.

Q: Can we predict what the bottleneck will be?

A: Not without a deep analysis of the entire system and workload, which is ... not practical.

Q: Can we deduce the optimal number of client-side threads for a given number of client side cores.

A: Not without knowing the answers to the previous two question.


Q: So what is the practical way to deal with this conundrum of how to size the thread pool?

A: Benchmarking and tuning!

Work out what your real workload is, create an indicative benchmark (or treat your workload as the benchmark), and run it repeatedly while adjusting the number of client-side threads up or down. At the same time, measure the actual CPU and I/O loads on the client and database to try to spot where the actual resource bottleneck is. These measures may be useful for other kinds of tuning (eg database and query optimization, network tuning) and for deciding if you need more hardware, faster network interfaces, etc.

If you take the "benchmark and tune", you don't need an accurate prediction for the number of threads.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM