简体   繁体   中英

Parallel inserts into database using Java threads

I have a Java program that needs to insert a large number of largish rows into a SQL Server database. The number of rows is 800k, and the size of each is around 200 bytes.

Currently they are divided into batches of 50, and then each batch is inserted using a single statement. (We've confirmed by JTDS logging that a single sp_exec call is used for each batch.) Tuning the batch size between 25 and 250 does not seem to have any significant effect, 50 is approximately the optimum.

I've experimented with dividing the batches into (say) 5 groups, and processing each group in parallel using threads. This is significantly faster -- more than twice as fast with 5 threads.

My question is about making the thread usage robust. In particular, if any of the batches fails, there will be an exception thrown. I want that exception to be caught and passed up to the caller, and I want to be 100% sure that the other threads have finished (either aborted, or completed) before we pass it up. Because when recovering from the exception later in the program, we don't want unexpected rows to continue to arrive in the table.

Here's what I've done:

/** Method to insert a single batch. */
private void insertBatchPostings(Collection<Posting> postings) throws PostingUpdateException
{
    // insert the batch using a single INSERT invokation
    // throw a PostingUpdateException if anything goes wrong
}

private static final int insertionThreads = 5;

/** Method to insert a collection of batches in parallel, using the above. */
protected void insertBatchPostingsThreaded(Collection<Collection<Posting>> batches) throws PostingUpdateException
{
    ExecutorService pool = Executors.newFixedThreadPool(insertionThreads);
    Collection<Future> futures = new ArrayList<Future>(batches.size());

    for (final Collection<Posting> batch : batches) {
        Callable c = new Callable() {
            public Object call() throws PostingUpdateException {
                insertBatchPostings(batch);
                return null;
            }            
        };
        /* So we submit each batch to the pool, and keep a note of its Future so we can check it later. */
        futures.add(pool.submit(c));
    }

    /* Pool is running, indicate that no further work will be submitted to it. */
    pool.shutdown();

    /* Check all the futures for problems. */
    for (Future f : futures) {
        try {
            f.get();
        } catch (InterruptedException ex) {
            throw new PostingUpdateException("Interrupted while processing insert results: " + ex.getMessage(), ex);
        } catch (ExecutionException ex) {
            pool.shutdownNow();
            throw (PostingUpdateException) ex.getCause();
        }
    }
}

By the time this returns I want to guarantee that all the threads are dormant.

Questions

(I'm trying to clarify what exactly I'm asking.)

  1. Is the above code completely robust, in that no threads insertions will continue to operate after insertBatchPostingsThreaded returns?
  2. Are there better and simpler ways of using the Java concurrency features to acheive this? My code looks ridiculously overcomplicated to me (raising the suspicion of missed edge cases).
  3. What is the best way to get it to fail as soon as any one thread fails?

I'm not a natural Java programmer so I'm hoping to end up with something that doesn't advertise that fact. :)

Guava's Futures.successfulAsList takes a list of futures as input and returns a future "whose value is a list containing the values of all its successful input futures." You could call get() on that generated Future , and then walk through your original future list to check for any failures.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM