简体繁体 English

批处理Flowfiles进入MergeContent

[英]Batching Flowfiles coming into MergeContent

原文 2016-11-17 12:51:18 3 1 apache-nifi

I'm using the MergeContent in the following way to "batch" incoming responses from a number of ExecuteSQL. 我正在以下列方式使用MergeContent来“批量”来自许多ExecuteSQL的传入响应。 In the MergeContent Processor, I have the Minimum Number of Entries set to 1000 and the Max Bin Age set to 30 seconds. 在MergeContent Processor中，我将最小条目数设置为1000，将最大Bin时间设置为30秒。 I then have a Correlation Attribute Name that bins the incoming FlowFiles. 然后我有一个关联属性名称，用于存储传入的FlowFiles。 This seems to be working as I expect, but my question is two fold: 这看起来像我期望的那样有效，但我的问题有两个：

A. Is this a sensible approach or is there a better/more efficient way to be doing this? 答：这是一种明智的方法还是有更好/更有效的方法来做到这一点？ Maybe a combo of ListFile/GetFile/MergeContent, etc... 也许是ListFile / GetFile / MergeContent等的组合......

B. Is there a performance/scalability issue with "larger" numbers of Minimum Number of Entries? B.是否存在性能/可扩展性问题，“最大数量”的最小条目数？

My end goal is to try to merge as many of the results coming out from ExecuteSQL commands into a single file, binned by its Correlation Attribute Name. 我的最终目标是尝试将来自ExecuteSQL命令的许多结果合并到一个文件中，并通过其关联属性名称进行分类。

1 个解决方案

Your approach seems solid. 你的方法似乎很扎实。 The SplitContent and MergeContent processors are designed to handle large numbers of flowfiles (remember that the flowfile content is not actually passed around the system in heap space, but rather is stored in the content repository and the flowfile acts as a reference pointer). SplitContent和MergeContent处理器旨在处理大量的流文件（请记住，流文件内容实际上并未在堆空间中传递，而是存储在内容存储库中，并且流文件充当引用指针）。 In many scenarios, we have seen users "stack" these processors -- ie reading a file with 1 million records, an initial SplitContent processor splits into flowfiles each containing 10,000 records, and then a second splits those flowfiles into individual records, instead of going from 1 million to 1 in a single operation. 在许多情况下，我们已经看到用户“堆叠”这些处理器 - 即读取具有100万条记录的文件，初始SplitContent处理器分成每个包含10,000条记录的流文件，然后第二条将这些流文件拆分为单独的记录，而不是去单次操作从100万到1。 This improves performance and reduces the chances of OutOfMemoryException s. 这样可以提高性能并降低OutOfMemoryException的可能性。

Similarly, you could have a second MergeContent processor to aggregate the flowfiles containing 1,000 entries each into a larger collection in a single flowfile. 同样，您可以使用第二个MergeContent处理器将包含1,000个条目的流文件聚合到单个流文件中的较大集合中。 The decision depends on your current throughput -- does the combination of 30 second binning and 1,000 entries get you to consistently have flowfiles with 1,000 entries, or are they only getting a few hundred? 决定取决于您当前的吞吐量 - 30秒分箱和1,000个条目的组合是否可以让您始终拥有包含1,000个条目的流量文件，或者它们只能获得几百个？ You can evaluate the data provenance of the flowfiles to determine this, and you can set up parallel flows to essentially A/B test your configurations. 您可以评估流文件的数据来源以确定这一点，并且您可以将并行流设置为基本上A / B测试您的配置。