简体   繁体   中英

Processing data based on the metadata in the file using apache camel

I have to setup camel to process data files where the first line of the file is the metadata and then it follows with millions of lines of actual data. The metadata dictates how the data is to be processed. What I am looking for is something like this:

Read first line (metadata) and populate a bean (with metadata) --> 2. then send data 1000 lines at a time to the data processor which will refer to the bean in step # 1

Is it possible in Apache Camel?

Yes.

An example architecture might look something like this:

  1. You could setup a simple queue that could be populated with file names (or whatever identifier you are using to locate each individual file).
  2. From the queue, you could route through a message translator bean, whose sole is to translate a request for a filename into a POJO that contains the metadata from the first line of the file.
  3. (You have a few options here)
    • Your approach to processing the 1000 line sets will depend on whether or not the output or resulting data created from the 1000 lines sets needs to be recomposed into a single message and processed again later. If so, you could implement a composed message processor made up of a message producer/consumer, a message aggregator and a router. The message producer/consumer would receive the POJO with the metadata created in step2 and enqueue as many new requests are necessary to process all of the lines in the file. The router would route from this queue through your processing pipeline and into the message aggregator. Once aggregated, a single unified message with all of your important data will be available for you to do what you will.
    • If instead each 1000 line set can be processed independently and rejoining is not required, than it is not necessary to agggregate the messages. Instead, you can use a router to route from step 2 to a producer/consumer that will, like above, enquene the necessary number of new requests for each file. Finally, the router will route from this final queue to a consumer that will do the processing.

Since you have a large quantity of data to deal with, it will likely be difficult to pass around 1000 line groups of data through messages, especially if they are being placed in a queue (you don't want to run out of memory). I recommend passing around some type of indicator that can be used to identify which line of the file a specific request was for, and then parse the 1000 lines when you need them. You could do this in a number of ways, like by calculating the number of bytes deep into a file a specific line is, and then using a file reader's skip() method to jump to that line when the request hits the bean that will be processing it.

Here are some resources provided on the Apache Camel website that describe the enterprise integration patterns that I mentioned above:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM