简体   繁体   English

MergeContent与nifi - 长度不一致

[英]MergeContent with nifi - inconsistent length

I am attempting to write a file on disk with the MergeContent processor, but I'm getting significantly varying file sizes - anywhere from one line to 806 lines. 我试图用MergeContent处理器在磁盘上写一个文件,但是我的文件大小变化很大 - 从一行到806行。 I've repeated the process many times over trying to figure out the newline demarcator as addressed in Apache NIFi MergeContent processor - set demarcator as new line and I've gotten really randomly sized files. 我试图找出在Apache NIFi MergeContent处理器中解决的新行标界符多次重复这个过程- 将demarcator设置为新行 ,我得到了真正随机大小的文件。

What parameters do I need to set to adhere to the following logic? 我需要设置哪些参数才能遵循以下逻辑?

  1. Establish a single bin 建立一个垃圾箱
  2. Route all flowfiles into bin 将所有流文件路由到bin中
  3. If len(bin)>X or the age of the bin is greater than Max Bin Age, release the bin 如果len(bin)> X或bin的年龄大于Max Bin Age,则释放bin

To fully document, I currently have the following attributes defined: 为了完整记录,我目前定义了以下属性: 合并内容处理器设置 合并内容处理器设置

As you can see, I've set "Max Bin Age" to "10 sec" following the syntax in https://github.com/apache/nifi/blob/31fba6b3332978ca2f6a1d693f6053d719fb9daa/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestMergeContent.java#L219 (which is the only place I've managed to find an example of this value, the documentation seems incomplete on this parameter) 如您所见,我按照https://github.com/apache/nifi/blob/31fba6b3332978ca2f6a1d693f6053d719fb9daa/nifi-nar-bundles/nifi-standard-bundle中的语法将“Max Bin Age”设置为“10秒” /nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestMergeContent.java#L219 (这是我设法找到这个值的一个例子的唯一地方,文档似乎不完整在这个参数上)

I've set "Maximum Number of Entries" to 5000, and "Maximum number of Bins" to 1 我将“最大条目数”设置为5000,将“最大条目数”设置为1

What do I need to do to aggregate my records following the logic above? 按照上述逻辑,我需要做些什么才能聚合我的记录? I also tried using the "Correlation Attribute Name" parameter with an attribute guaranteed to be identical on all documents reaching this point, and saw the same 我还尝试使用“关联属性名称”参数,其属性保证在达到此点的所有文档上都相同,并且看到了相同的

The most important thing here is actually the minimum number of entries. 这里最重要的是实际上最少的条目数。 What is happening is that the binning algorithm takes a lenient approach in terms of the number of items. 发生的事情是,分箱算法在项目数量方面采取宽松的方法。

For your specific logic, you would want to let things as they stand and: 根据您的具体逻辑,您可能希望按照它们的原则进行操作:

  • Set Minimum Number of Entries to 5000 将最小条目数设置为5000
  • Optionally, increase the maximum number of entries. (可选)增加最大条目数。 Leaving it as configured will generate bins that are exactly 5000 entries except for those periods where the age interval has been eclipsed 将其保留为已配置将生成正好为5000个条目的区间,但是年龄区间已经黯然失色的区域除外

Below is an image of the configuration above where min and max bin size are both 5000 and only 1 bin is handled at a time. 下面是上面配置的图像,其中min和max bin大小都是5000,并且一次只处理1个bin。 In this case you'll see that exactly 20000 files have been merged into 4. 在这种情况下,您将看到正好将20000个文件合并为4个。

最小和最大bin大小为5000的示例执行

In case anyone is having this exact issue, the cause may be not setting the schedule on the MergeContent processor. 如果有人遇到这个问题,原因可能是没有在MergeContent处理器上设置计划。 After a lot of troubleshooting, I realized that this is one of those processors where "0 sec" is not an appropriate schedule. 经过大量的故障排除后,我意识到这是“0秒”不合适的处理器之一。 I had already set my Min Entries to some high number and Max Entries. 我已经将我的Min Entries设置为一些高数字和Max Entries。 Max Bin Age was set to 5 min. Max Bin Age设定为5分钟。 It was the schedule that was causing the processor to keep grabbing flowfiles and bundling them up in random sizes. 正是时间表导致处理器继续抓取流文件并以随机大小捆绑它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM