简体   繁体   中英

MapReduce/Aggregate operations in SpringBatch

Is it possible to do MapReduce style operations in SpringBatch?

I have two steps in my batch job. The first step calculates average. The second step compares each value with average to determine another value.

For example, Lets say i have a huge database of Student scores. The first step calculates average score in each course/exam. The second step compares individual scores with average to determine grade based on some simple rule:

  1. A if student scores above average
  2. B if student score is Average
  3. C if student scores below average

Currently my first step is a Sql which selects average and writes it to a table. Second step is a Sql which joins average scores with individual scores and uses a Processor to implement the rule.

There are similar aggregation functions like avg, min used a lot in Steps and I'd really prefer if this can be done in Processors keeping the Sqls as simple as possible. Is there any way to write a Processor which aggregates results across multiple rows based on a grouping criteria and then Writes Average/Min to the Output table once?

This pattern repeats a lot and i'm not looking for a Single processor implementation using a Sql which fetches both average and individual scores.

It is possible. You do not even need more than one step. Map-Reduce can be implemented in a single step. You can create a step with ItemReader and ItemWriter associated with it. Think of ItemReader -ItemWriter pair as of Map- Reduce. You can achieve the neccessary effect by using custom reader and writer with propper line aggregation. It might be a good idea for your reader/writer to implement Stream interface to guarantee intermediate StepContext save operation by Spring batch.

I tried it just for fun, but i think that it is pointless since your working capacity is limited by single JVM, in other words: you could not reach Hadoop cluster (or other real map reduce implementationns) production environment performance. Also it will be really hard to be scallable as your data size grows.

Nice observation but IMO currently useless for real world tasks.

I feel that a batch processing framework should separate programming/configuration and run-time concerns.It would be nice if spring batch provides a generic solution over a all major batch processing run times like JVM, Hadoop Cluster(also uses JVM) etc.

-> Write batch programs using Spring batch programming/Configuration model that integrates other programming models like map-reduce,traditional java etc.

-> Select the run-times based on your need (single JVM or Hadoop Cluster or NoSQL).

Spring Data attempts solve a part of it, providing a unified configuration model and API usage for various type of data sources.).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM