简体   繁体   English

如何找出memory(OOM)容器出问题的根本原因

[英]How to find out the root cause of container going out of memory (OOM)

I am running a batch in my Micronaut application which fetches 500 000 records from db, picking batch of 100 items and after doing processing (which includes a api call for that batch) again inserting the data in another sqlite table.我在我的 Micronaut 应用程序中运行一个批处理,它从数据库中获取 500 000 条记录,挑选 100 个项目的批次,并在进行处理后(包括对该批次的 api 调用)再次将数据插入另一个 sqlite 表中。

     try (Connection connection = dataSource.getConnection();
            PreparedStatement statement = connection.prepareStatement("SELECT id,item_id,type,operation FROM table WHERE serial_id = ? AND type = ? AND fail_reason IS NULL"  )) {
            statement.setString(1, serialId);
            statement.setString(2, type.name());
            ResultSet resultSet = statement.executeQuery();
            List<ItemEntity> itemEntities = new ArrayList<>(batchSize);
            int i = 0;
            while (resultSet.next()) {
                itemEntities.add(ItemEntity.builder()
                        .id(resultSet.getString("id"))
                        .itemId(resultSet.getString("item_id"))
                        .type(ItemType.valueOf(resultSet.getString("type")))
                        .operation(Operation.valueOf(resultSet.getString("operation")))
                        .build());

                i++;

                if(i == batchSize) {
                    i = 0;
                    consumer.accept(itemEntities);
                    itemEntities.clear();
                }
            }

            if(!itemEntities.isEmpty())
                consumer.accept(itemEntities);
        } catch (Exception ex) {
            log.error("error", ex);
            throw new RuntimeException("error", ex);
        }
    }

Whenever this batch is running container is restarting with exit code 137. I have checked with the below jvm arguments每当此批次运行时,容器都会以退出代码 137 重新启动。我已经检查了以下 jvm arguments

-XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError -XX:HeapDumpPath=/var/data/heapdump.hprof

As I am not getting any heapdump file after container restart So I am assuming it's not the Micronaut application that is causing container to go OOM.因为在容器重启后我没有得到任何堆转储文件所以我假设不是 Micronaut 应用程序导致容器出现 go OOM。

Container memory limit is 512m.容器 memory 限制为 512m。

What are all the things I can try to debug this issue?我可以尝试调试此问题的所有内容是什么?

A technique that i find effective to debug issues I can call "isolate and strip down".我发现一种可以有效调试问题的技术,我可以称之为“隔离和剥离”。 The idea behind this is to isolate all the factors of this problem into separate concerns and keep iterating to find out which factor weighs into the most to cause the problem.这背后的想法是将这个问题的所有因素隔离到单独的关注点中,并不断迭代以找出哪个因素对导致问题的影响最大。 It is a problem identification exercise.这是一个问题识别练习。 Once you have listed out some candidate problems, then you can shift your attention to a solution mindset.列出一些候选问题后,您就可以将注意力转移到解决方案的思维方式上。

For your particular issue, I can summarize whats happening as such (correct me if I am wrong)对于您的特定问题,我可以总结一下发生的事情(如果我错了请纠正我)

  • use a database connection to prepare a statement, then assign it some variables, and execute.使用数据库连接准备一条语句,然后为其分配一些变量,然后执行。
  • get the result set.得到结果集。 with the result set, fill the item entities and send them to a consumer in a batch (about 100).使用结果集,填充项目实体并将它们批量发送给消费者(大约 100 个)。 Then repeat然后重复

From here i can extract a list of concerns.从这里我可以提取一个问题列表。

    1. sql query the database and expect a result of 50k sql 查询数据库,期望结果为50k
    1. processing one resultSet row into an item entity将一个 resultSet 行处理成一个项目实体
    1. passing one item entity into the consumer.将一个项目实体传递给消费者。

So isolate these concerns to debug it.所以隔离这些问题来调试它。 While doing so comment out or delete the other parts of the code.同时注释掉或删除代码的其他部分。

For item 1, see if the resulting set of 50k is causing an OOM error.对于第 1 项,查看结果集 50k 是否导致 OOM 错误。 adjust the query to have a max of 10k, 20k, 30k.将查询调整为最大 10k、20k、30k。 Or start with 1k;或者从1k开始; see if 100 results works.查看 100 个结果是否有效。 Change the query a bit and see if you can add to it bit by bit and see when it break.稍微更改一下查询,看看是否可以一点一点地添加它,看看它何时中断。

For item 2, is building the items in a loop of a 100 a problem.对于第 2 项,是在 100 个问题的循环中构建项目。 How about can you build it all the way to 500k.你能把它一直建到 500k 吗? Try different amounts.尝试不同的量。

For Item 3, can you process one item to consumer without any issues.对于第 3 项,您能否在没有任何问题的情况下向消费者处理一项。 Can you send it mock items just created by hand and processing one batch of 100. Then raise it to a batch of 1000 or 5000. If successful rotate through and see at what point it fails.您能否将刚刚手工创建的模拟项目发送给它并处理一批 100 个。然后将其提高到一批 1000 或 5000 个。如果成功轮换,看看它在什么时候失败了。

Another angle is to mess with the memory limits.另一个角度是搞乱 memory 的限制。 I would do this afterwards once you have some subset of the above working, and keeping incrementing until you see it break.一旦您完成上述工作的某些子集,我会在之后执行此操作,并保持递增直到您看到它中断。

A still another route is see if you can use Micronaut Repositories and entities to solve the same issue.另一种方法是查看您是否可以使用 Micronaut 存储库和实体来解决相同的问题。 both in a smaller context like 100 entities and then try to increment up to 50k.两者都在较小的上下文中,如 100 个实体,然后尝试增加到 50k。 I dont think from what i see above that is micronaut issue, but experimenting will help rule it out with proof我不认为我在上面看到的是 micronaut 问题,但试验将有助于通过证据排除它

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM