简体   繁体   中英

Batch processing flowfiles in apache nifi

I have written custom nifi processor which tries to batch process input flow files.

However, it seems it is not behaving as expected. Here is what happening:

I copy paste some files on server. FethFromServerProcessor fetches those files from server and puts it in queue1 . MyCustomProcessor reads files in batch from queue1 . I have batchSize property defined on MyCustomProcessor and inside its onTrigger() method, I am getting all flow files from queue1 in current batch by doing following:

 session.get(context.getProperty(batchSize).asInteger()) 

First line of onTrigger() creates timestamp and adds this timestamp on all flow files. So all files in the batch should have same timestamp. However, that is not happening. Usually first flow file get one timestamp and rest of the flow files get other timestamp.

It seems that when FetchFromServerProcessor fetches first file from server and puts it in the queue1 , MyCustomProcessor gets triggered and it fetches all files from queue. Incidentally, it happens that there used to be single file, which is picked up as only file in this batch. By the time MyCustomProcessor has processed this file, FetchFromServerProcessor has fetched all the files from server and put them in the queue1 . So after processing first file, MyCustomProcessor takes all the files in queue1 and forms second batch, whereas I want all files picked up in single batch.

How can I avoid two batches getting formed? I see that people discuss wait-notify in this context: 1 , 2 . But I am not able to make quick sense out of these posts. Can someone give me minimal steps to achieve this using wait notify processors or can someone point me to minimal tutorial which gives step by step procedure to use wait-notify processors? Also is wait-notify pattern standard approach to solve batch related problem I explained? Or is there any other standard approach to get this done?

It sounds as if this batch size is the required count of incoming flowfiles to CustomProcessor , so why not write your CustomProcessor#onTrigger() as follows:

@Override
public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
    final ComponentLog logger = getLogger();
    // Try to get n flowfiles from incoming queue
    final Integer desiredFlowfileCount = context.getProperty(batchSize).asInteger();
    final int queuedFlowfileCount = session.getQueueSize().getObjectCount();
    if (queuedFlowfileCount < desiredFlowfileCount) {
        // There are not yet n flowfiles queued up, so don't try to run again immediately
        if (logger.isDebugEnabled()) {
            logger.debug("Only {} flowfiles queued; waiting for {}", new Object[]{queuedFlowfileCount, desiredFlowfileCount});
        }
        context.yield();
        return;
    }

    // If we're here, we do have at least n queued flowfiles
    List<FlowFile> flowfiles = session.get(desiredFlowfileCount);

    try {
        // TODO: Perform work on all flowfiles
        flowfiles = flowfiles.stream().map(f -> session.putAttribute(f, "timestamp", "my static timestamp value")).collect(Collectors.toList());
        session.transfer(flowfiles, REL_SUCCESS);

        // If extending AbstractProcessor, this is handled for you and you don't have to explicitly commit
        session.commit();
    } catch (Exception e) {
        logger.error("Helpful error message");
        if (logger.isDebugEnabled()) {
            logger.error("Further stacktrace: ", e);
        }
        // Penalize the flowfiles if appropriate (also done for you if extending AbstractProcessor and an exception is thrown from this method
        session.rollback(true);
        //  --- OR ---
        // Transfer to failure if they can't be retried
        session.transfer(flowfiles, REL_FAILURE);
    }
}

The Java 8 stream syntax can be replaced by this if it's unfamiliar:

        for (int i = 0; i < flowfiles.size(); i++) {
            // Write the same timestamp value onto all flowfiles
            FlowFile f = flowfiles.get(i);
            flowfiles.set(i, session.putAttribute(f, "timestamp", "my timestamp value"));
        }

The semantics between penalization (telling the processor to delay performing work on a specific flowfile) and yielding (telling the processor to wait some period of time to try performing any work again) are important.

You probably also want the @TriggerSerially annotation on your custom processor to ensure you do not have multiple threads running such that a race condition could arise.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM