在 Spring Boot 中优化数据获取和插入

Question

I have 270000 records in a CSV file with columns user_id, book_ISBN, and book_rating, I need to insert the records into a many-to-many table.我在 CSV 文件中有 270000 条记录，其中包含 user_id、book_ISBN 和 book_rating 列，我需要将记录插入到多对多表中。 I parsed the data with openCSV library and the result is a list.我用 openCSV 库解析了数据，结果是一个列表。

public List<UserRatingDto> uploadRatings(MultipartFile file) throws IOException{
        BufferedReader fileReader = new BufferedReader(new
                InputStreamReader(file.getInputStream(), "UTF-8"));

        List<UserRatingDto> ratings = new CsvToBeanBuilder<UserRatingDto>(fileReader)
                .withType(UserRatingDto.class)
                .withSeparator(';')
                .withIgnoreEmptyLine(true)
                .withSkipLines(1)
                .build()
                .parse();
        return ratings;
    }

There are no performance issues with this, it takes approximately 1 minute to parse.这没有性能问题，解析大约需要 1 分钟。 However, in order to insert these into a table, I need to fetch books and users from the DB in order to form the relationship, I tried to make the method async with @Async annotation, I tried parallel stream, I tried putting the objects into a stack and using saveAll() to bulk insert, but it still takes way too much time.但是，为了将这些插入到表中，我需要从数据库中获取书籍和用户以形成关系，我尝试使用 @Async 注释使方法异步，我尝试并行流，我尝试放置对象进入堆栈并使用 saveAll() 批量插入，但仍然需要太多时间。

 public void saveRatings(final MultipartFile file) throws IOException{
        List<UserRatingDto> userRatingDtos = uploadRatings(file);

        userRatingDtos.parallelStream().forEach(bookRating->{
            UserEntity user = userRepository.findByUserId(bookRating.getUserId());
            bookRepository.findByISBN(bookRating.getBookISBN()).ifPresent(book -> {
                BookRating bookRating1 = new BookRating();
                bookRating1.setRating(bookRating.getBookRating());
                bookRating1.setUser(user);
                bookRating1.setBook(book);
                book.getRatings().add(bookRating1);
                user.getRatings().add(bookRating1);
                bookRatingRepository.save(bookRating1);
            });

        });
}

This is what I have now, is there anything I can change to make this faster?这就是我现在所拥有的，有什么我可以改变的来让它更快吗？

Answer 1

The problem is data is being fetched and persisted one by one.问题是数据正在被一一获取和持久化。 The most performant way to access data is usually well defined batches , then following the pattern:访问数据的最高效方式通常是well defined batches ，然后遵循以下模式：

fetch data required for processing the batch获取处理批处理所需的数据
process the batch in memory在内存中处理批处理
persist processing results before fetching the next batch在获取下一批之前保留处理结果

For your specific use case, you can do something like:对于您的特定用例，您可以执行以下操作：

    public void saveRatings(final MultipartFile file) throws IOException {
        List<UserRatingDto> userRatingDtos = uploadRatings(file);

        // Split the list into batches
        getBatches(userRatingDtos, 100).forEach(this::processBatch);
    }

    private void processBatch(List<UserRatingDto> userRatingBatch) {
        
        // Retrieve all data required to process a batch
        Map<String, UserEntity> users = userRepository
                .findAllById(userRatingBatch.stream().map(UserRatingDto::getUserId).toList())
                .stream()
                .collect(toMap(UserEntity::getId, user -> user));
        Map<String, Book> books = bookRepository.findAllByIsbn(userRatingBatch.stream().map(UserRatingDto::getBookISBN).toList())
                .stream()
                .collect(toMap(Book::getIsbn, book -> book));

        // Process each rating in memory
        List<BookRating> ratingsToSave = userRatingBatch.stream().map(bookRatingDto -> {
            Book book = books.get(bookRatingDto.getBookISBN());
            if (book == null) {
                return null;
            }
            UserEntity user = users.get(bookRatingDto.getUserId());
            BookRating bookRating = new BookRating();
            bookRating.setRating(bookRatingDto.getBookRating());
            bookRating.setUser(user);
            bookRating.setBook(book);
            book.getRatings().add(bookRating);
            user.getRatings().add(bookRating);
            return bookRating;
        }).filter(Objects::nonNull).toList();

        // Save data in batches
        bookRatingRepository.saveAll(ratingsToSave);
        bookRepository.saveAll(books.values());
        userRepository.saveAll(users.values());

    }

    public <T> List<List<T>> getBatches(List<T> collection, int batchSize) {
        List<List<T>> batches = new ArrayList<>();
        for (int i = 0; i < collection.size(); i += batchSize) {
            batches.add(collection.subList(i, Math.min(i + batchSize, collection.size())));
        }
        return batches;
    }

Note that all I/O should always be done in batches.请注意，所有 I/O 应始终分批完成。 If you have a single DB lookup or save in the inner processing loop this will not work at all.如果您有单个数据库查找或保存在内部处理循环中，这根本不起作用。

You can try different batch sizes to see what brings better performance - the bigger the batch the longer transactions will remain open, and not always bigger batches result in better overall performance.您可以尝试不同的batch sizes ，看看什么会带来更好的性能 - 批次越大，事务将保持打开状态的时间越长，而且并非总是更大的批次会带来更好的整体性能。

Also, make sure you handle errors gracefully - for example:此外，请确保您优雅地处理错误 - 例如：

if a batch throws an error, you can break such a batch in two, and so on until only one rating fails.如果批次抛出错误，您可以将这样的批次分成两部分，依此类推，直到只有一个评级失败。
you can also retry a failing batch with backoff if for example there's a DB access problem.例如，如果存在数据库访问问题，您还可以使用退避重试失败的批处理。
you can discard a rating if for example you have a null required field例如，如果您有一个空的必填字段，您可以放弃评分

EDIT: As per OP's comment, this increased performance 10x+.编辑：根据 OP 的评论，这提高了 10 倍以上的性能。 Also, if ordering is not important performance can still be greatly improved by processing each batch in parallel.此外，如果排序不重要，通过并行处理每个批次仍然可以大大提高性能。

EDIT 2: As a general pattern, ideally we wouldn't have all records in memory to begin with, instead retrieving data to be processed in batches as well.编辑 2：作为一般模式，理想情况下，我们不会将所有记录都保存在内存中，而是检索要分批处理的数据。 This would further improve performance and avoid OOM errors.这将进一步提高性能并避免 OOM 错误。

Also, this can be done in many concurrency patterns, for example having dedicated threads to fetch data, worker threads to process it, and another set of threads to persist the results.此外，这可以在许多并发模式中完成，例如有专用线程来获取数据，工作线程来处理它，以及另一组线程来持久化结果。

The easiest pattern is having each unit of work being independent - they're given what they should process (eg a set of ids to fetch from DB), then retrieve the necessary data for processing, process it in memory, and persist the results.最简单的模式是让每个工作单元独立——他们被赋予了他们应该处理的内容（例如一组从数据库中获取的 id），然后检索必要的数据进行处理，在内存中处理它，并保存结果。

Answer 2

Why not just use a temporary staging table like this (possibly using NOLOGGING and other optimisations, if available):为什么不只使用这样的临时登台表（可能使用NOLOGGING和其他优化，如果可用）：

CREATE TEMPORARY TABLE load_book_rating (
  user_id BIGINT,
  book_isbn TEXT,
  rating TEXT
);

Then batch load the CSV data into that staging table, then bulk insert all the data in the real table, like this:然后将 CSV 数据批量加载到该临时表中，然后批量插入真实表中的所有数据，如下所示：

INSERT INTO book_rating (user_id, book_id, book_rating)
SELECT l.user_id, b.id, l.book_rating
FROM load_book_rating AS l
JOIN book AS b ON l.book_isbn = b.isbn

I may have overlooked some details from your schema, but my main point here is that you're probably doing all these hoops only because of the ISBN natural key that you're not using as a primary key of your BOOK table, so you have to perform a lookup?我可能忽略了您的架构中的一些细节，但我的主要观点是，您可能只是因为您没有将ISBN自然键用作BOOK表的主键，所以您正在做所有这些箍，所以您有执行查找？

Alternatively, use your RDBMS's native CSV import capabilities.或者，使用 RDBMS 的本机 CSV 导入功能。 Most of them can do it, see eg PostgreSQL's COPY command他们中的大多数都可以做到，例如PostgreSQL 的COPY命令

I'm pretty sure that a purely SQL based approach will outperform any other approach that you may implement in Java.我很确定纯基于 SQL 的方法将胜过您可能在 Java 中实现的任何其他方法。

在 Spring Boot 中优化数据获取和插入

问题描述

2 个解决方案

解决方案1
6 2022-05-25 00:51:15

解决方案2
4 2022-06-03 13:10:20

在 Spring Boot 中优化数据获取和插入

问题描述

2 个解决方案

解决方案1 6 2022-05-25 00:51:15

解决方案2 4 2022-06-03 13:10:20

解决方案1
6 2022-05-25 00:51:15

解决方案2
4 2022-06-03 13:10:20