Apache NiFi - OutOfMemory 错误：SplitText 处理器超出了 GC 开销限制

Question

I am trying to use NiFi to process large CSV files (potentially billions of records each) using HDF 1.2.我正在尝试使用 NiFi 使用 HDF 1.2 处理大型 CSV 文件（每个文件可能有数十亿条记录）。 I've implemented my flow, and everything is working fine for small files.我已经实现了我的流程，对于小文件，一切正常。

The problem is that if I try to push the file size to 100MB (1M records) I get a java.lang.OutOfMemoryError: GC overhead limit exceeded from the SplitText processor responsible of splitting the file into single records.问题是，如果我尝试将文件大小推到 100MB（1M 记录），我会收到一个java.lang.OutOfMemoryError: GC overhead limit exceeded来自负责将文件拆分为单个记录的SplitText处理器。 I've searched for that, and it basically means that the garbage collector is executed for too long without obtaining much heap space.我已经搜索过了，这基本上意味着垃圾收集器执行时间过长而没有获得太多堆空间。 I expect this means that too many flow files are being generated too fast.我预计这意味着太多的流文件生成得太快了。

How can I solve this?我该如何解决这个问题？ I've tried changing nifi's configuration regarding the max heap space and other memory-related properties, but nothing seems to work.我已经尝试更改 nifi 关于最大堆空间和其他内存相关属性的配置，但似乎没有任何效果。

Right now I added an intermediate SplitText with a line count of 1K and that allows me to avoid the error, but I don't see this as a solid solution for when the incoming file size will become potentially much more than that, I am afraid I will get the same behavior from the processor.现在我添加了一个行数为 1K 的中间SplitText ，这使我能够避免错误，但我不认为这是一个可靠的解决方案，因为当传入的文件大小可能会变得更大时，恐怕我将从处理器获得相同的行为。

Any suggestion is welcomed!欢迎任何建议！ Thank you谢谢

Answer 1

The reason for the error is when splitting 1M records with a line count of 1, you are creating 1M flow files which equate 1M Java objects. 出错的原因是，当使用行数1分割1M记录时，您将创建1M流程文件，这些文件等同于1M Java对象。 Overall the approach of using two SplitText processors is common and avoids creating all of the objects at the same time. 总的来说，使用两个SplitText处理器的方法很常见，并且避免同时创建所有对象。 You could probably use an even larger split size on the first split, maybe 10k. 您可以在第一次拆分时使用更大的拆分尺寸，可能是10k。 For a billion records I am wondering if a third level would make sense, split from 1B to maybe 10M, then 10M to 10K, then 10K to 1, but I would have to play with it. 对于十亿个记录，我想知道第三级是否有意义，从1B分为10M，然后从10M分到10K，然后从10K分到1，但我必须要玩它。

Some additional things to consider are increasing the default heap size from 512MB, which you may have already done, and also figuring out if you really need to split down to 1 line. 还需要考虑的其他一些事项是将默认堆大小从512MB（您可能已经完成）增加，并确定是否确实需要拆分为1行。 It is hard to say without knowing anything else about the flow, but in a lot of cases if you want to deliver each line somewhere you could potentially have a processor that reads in a large delimited file and streams each line to the destination. 如果不知道关于流程的任何其他内容，很难说，但在很多情况下，如果你想在某个地方交付每一行，你可能会有一个读取大分隔文件的处理器并将每一行流到目的地。 For example, this is how PutKafka and PutSplunk work, they can take a file with 1M lines and stream each line to the destination. 例如，这就是PutKafka和PutSplunk的工作方式，它们可以获取包含1M行的文件，并将每行传输到目标。

Answer 2

I had a similar error while using the GetMongo processor in Apache NiFi.我在 Apache NiFi 中使用 GetMongo 处理器时遇到了类似的错误。 I changed my configurations to:我将配置更改为：

Limit: 100
Batch Size: 10

Then the error disappeared.然后错误消失了。

Apache NiFi - OutOfMemory 错误：SplitText 处理器超出了 GC 开销限制

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-07-29 12:44:37

解决方案2
0 2021-10-12 11:34:12

Apache NiFi - OutOfMemory 错误：SplitText 处理器超出了 GC 开销限制

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-07-29 12:44:37

解决方案2 0 2021-10-12 11:34:12

解决方案1
5 已采纳 2016-07-29 12:44:37

解决方案2
0 2021-10-12 11:34:12