Optimizing data fetching and insertion in Spring Boot

Question

I have 270000 records in a CSV file with columns user_id, book_ISBN, and book_rating, I need to insert the records into a many-to-many table. I parsed the data with openCSV library and the result is a list.

public List<UserRatingDto> uploadRatings(MultipartFile file) throws IOException{
        BufferedReader fileReader = new BufferedReader(new
                InputStreamReader(file.getInputStream(), "UTF-8"));

        List<UserRatingDto> ratings = new CsvToBeanBuilder<UserRatingDto>(fileReader)
                .withType(UserRatingDto.class)
                .withSeparator(';')
                .withIgnoreEmptyLine(true)
                .withSkipLines(1)
                .build()
                .parse();
        return ratings;
    }

There are no performance issues with this, it takes approximately 1 minute to parse. However, in order to insert these into a table, I need to fetch books and users from the DB in order to form the relationship, I tried to make the method async with @Async annotation, I tried parallel stream, I tried putting the objects into a stack and using saveAll() to bulk insert, but it still takes way too much time.

 public void saveRatings(final MultipartFile file) throws IOException{
        List<UserRatingDto> userRatingDtos = uploadRatings(file);

        userRatingDtos.parallelStream().forEach(bookRating->{
            UserEntity user = userRepository.findByUserId(bookRating.getUserId());
            bookRepository.findByISBN(bookRating.getBookISBN()).ifPresent(book -> {
                BookRating bookRating1 = new BookRating();
                bookRating1.setRating(bookRating.getBookRating());
                bookRating1.setUser(user);
                bookRating1.setBook(book);
                book.getRatings().add(bookRating1);
                user.getRatings().add(bookRating1);
                bookRatingRepository.save(bookRating1);
            });

        });
}

This is what I have now, is there anything I can change to make this faster?

Answer 1

The problem is data is being fetched and persisted one by one. The most performant way to access data is usually well defined batches , then following the pattern:

fetch data required for processing the batch
process the batch in memory
persist processing results before fetching the next batch

For your specific use case, you can do something like:

    public void saveRatings(final MultipartFile file) throws IOException {
        List<UserRatingDto> userRatingDtos = uploadRatings(file);

        // Split the list into batches
        getBatches(userRatingDtos, 100).forEach(this::processBatch);
    }

    private void processBatch(List<UserRatingDto> userRatingBatch) {
        
        // Retrieve all data required to process a batch
        Map<String, UserEntity> users = userRepository
                .findAllById(userRatingBatch.stream().map(UserRatingDto::getUserId).toList())
                .stream()
                .collect(toMap(UserEntity::getId, user -> user));
        Map<String, Book> books = bookRepository.findAllByIsbn(userRatingBatch.stream().map(UserRatingDto::getBookISBN).toList())
                .stream()
                .collect(toMap(Book::getIsbn, book -> book));

        // Process each rating in memory
        List<BookRating> ratingsToSave = userRatingBatch.stream().map(bookRatingDto -> {
            Book book = books.get(bookRatingDto.getBookISBN());
            if (book == null) {
                return null;
            }
            UserEntity user = users.get(bookRatingDto.getUserId());
            BookRating bookRating = new BookRating();
            bookRating.setRating(bookRatingDto.getBookRating());
            bookRating.setUser(user);
            bookRating.setBook(book);
            book.getRatings().add(bookRating);
            user.getRatings().add(bookRating);
            return bookRating;
        }).filter(Objects::nonNull).toList();

        // Save data in batches
        bookRatingRepository.saveAll(ratingsToSave);
        bookRepository.saveAll(books.values());
        userRepository.saveAll(users.values());

    }

    public <T> List<List<T>> getBatches(List<T> collection, int batchSize) {
        List<List<T>> batches = new ArrayList<>();
        for (int i = 0; i < collection.size(); i += batchSize) {
            batches.add(collection.subList(i, Math.min(i + batchSize, collection.size())));
        }
        return batches;
    }

Note that all I/O should always be done in batches. If you have a single DB lookup or save in the inner processing loop this will not work at all.

You can try different batch sizes to see what brings better performance - the bigger the batch the longer transactions will remain open, and not always bigger batches result in better overall performance.

Also, make sure you handle errors gracefully - for example:

if a batch throws an error, you can break such a batch in two, and so on until only one rating fails.
you can also retry a failing batch with backoff if for example there's a DB access problem.
you can discard a rating if for example you have a null required field

EDIT: As per OP's comment, this increased performance 10x+. Also, if ordering is not important performance can still be greatly improved by processing each batch in parallel.

EDIT 2: As a general pattern, ideally we wouldn't have all records in memory to begin with, instead retrieving data to be processed in batches as well. This would further improve performance and avoid OOM errors.

Also, this can be done in many concurrency patterns, for example having dedicated threads to fetch data, worker threads to process it, and another set of threads to persist the results.

The easiest pattern is having each unit of work being independent - they're given what they should process (eg a set of ids to fetch from DB), then retrieve the necessary data for processing, process it in memory, and persist the results.

Answer 2

Why not just use a temporary staging table like this (possibly using NOLOGGING and other optimisations, if available):

CREATE TEMPORARY TABLE load_book_rating (
  user_id BIGINT,
  book_isbn TEXT,
  rating TEXT
);

Then batch load the CSV data into that staging table, then bulk insert all the data in the real table, like this:

INSERT INTO book_rating (user_id, book_id, book_rating)
SELECT l.user_id, b.id, l.book_rating
FROM load_book_rating AS l
JOIN book AS b ON l.book_isbn = b.isbn

I may have overlooked some details from your schema, but my main point here is that you're probably doing all these hoops only because of the ISBN natural key that you're not using as a primary key of your BOOK table, so you have to perform a lookup?

Alternatively, use your RDBMS's native CSV import capabilities. Most of them can do it, see eg PostgreSQL's COPY command

I'm pretty sure that a purely SQL based approach will outperform any other approach that you may implement in Java.

Optimizing data fetching and insertion in Spring Boot

Question

2 answers

solution1
6 2022-05-25 00:51:15

solution2
4 2022-06-03 13:10:20

Optimizing data fetching and insertion in Spring Boot

Question

2 answers

solution1 6 2022-05-25 00:51:15

solution2 4 2022-06-03 13:10:20

solution1
6 2022-05-25 00:51:15

solution2
4 2022-06-03 13:10:20