简体   繁体   中英

Batching Flowfiles coming into MergeContent

I'm using the MergeContent in the following way to "batch" incoming responses from a number of ExecuteSQL. In the MergeContent Processor, I have the Minimum Number of Entries set to 1000 and the Max Bin Age set to 30 seconds. I then have a Correlation Attribute Name that bins the incoming FlowFiles. This seems to be working as I expect, but my question is two fold:

A. Is this a sensible approach or is there a better/more efficient way to be doing this? Maybe a combo of ListFile/GetFile/MergeContent, etc...

B. Is there a performance/scalability issue with "larger" numbers of Minimum Number of Entries?

My end goal is to try to merge as many of the results coming out from ExecuteSQL commands into a single file, binned by its Correlation Attribute Name.

Your approach seems solid. The SplitContent and MergeContent processors are designed to handle large numbers of flowfiles (remember that the flowfile content is not actually passed around the system in heap space, but rather is stored in the content repository and the flowfile acts as a reference pointer). In many scenarios, we have seen users "stack" these processors -- ie reading a file with 1 million records, an initial SplitContent processor splits into flowfiles each containing 10,000 records, and then a second splits those flowfiles into individual records, instead of going from 1 million to 1 in a single operation. This improves performance and reduces the chances of OutOfMemoryException s.

Similarly, you could have a second MergeContent processor to aggregate the flowfiles containing 1,000 entries each into a larger collection in a single flowfile. The decision depends on your current throughput -- does the combination of 30 second binning and 1,000 entries get you to consistently have flowfiles with 1,000 entries, or are they only getting a few hundred? You can evaluate the data provenance of the flowfiles to determine this, and you can set up parallel flows to essentially A/B test your configurations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM