简体繁体 English

SpringBatch 中的 MapReduce/Aggregate 操作

[英]MapReduce/Aggregate operations in SpringBatch

原文 2011-05-25 06:55:37 7 2 java/ mapreduce/ batch-processing/ spring-batch

Is it possible to do MapReduce style operations in SpringBatch?是否可以在 SpringBatch 中进行 MapReduce 样式的操作？

I have two steps in my batch job.我的批处理作业有两个步骤。 The first step calculates average.第一步计算平均值。 The second step compares each value with average to determine another value.第二步将每个值与平均值进行比较以确定另一个值。

For example, Lets say i have a huge database of Student scores.例如，假设我有一个庞大的学生成绩数据库。 The first step calculates average score in each course/exam.第一步计算每门课程/考试的平均分数。 The second step compares individual scores with average to determine grade based on some simple rule:第二步将个人分数与平均分数进行比较，以根据一些简单的规则确定成绩：

A if student scores above average A 如果学生成绩高于平均水平
B if student score is Average B 如果学生成绩是平均
C if student scores below average C 如果学生成绩低于平均水平

Currently my first step is a Sql which selects average and writes it to a table.目前我的第一步是 Sql 选择平均值并将其写入表。 Second step is a Sql which joins average scores with individual scores and uses a Processor to implement the rule.第二步是 Sql，它将平均分数与个人分数结合起来，并使用处理器来实施规则。

There are similar aggregation functions like avg, min used a lot in Steps and I'd really prefer if this can be done in Processors keeping the Sqls as simple as possible.有类似的聚合函数，如 avg，min 在 Steps 中使用了很多，如果这可以在保持 Sqls 尽可能简单的处理器中完成，我真的更喜欢。 Is there any way to write a Processor which aggregates results across multiple rows based on a grouping criteria and then Writes Average/Min to the Output table once?有没有办法编写一个处理器，它根据分组标准在多行中聚合结果，然后将平均值/最小值写入 Output 表一次？

This pattern repeats a lot and i'm not looking for a Single processor implementation using a Sql which fetches both average and individual scores.这种模式重复了很多，我不是在寻找使用 Sql 的单处理器实现，它可以获取平均分数和个人分数。

2 个解决方案

It is possible.有可能的。 You do not even need more than one step.你甚至不需要超过一个步骤。 Map-Reduce can be implemented in a single step. Map-Reduce 可以一步实现。 You can create a step with ItemReader and ItemWriter associated with it.您可以创建一个与 ItemReader 和 ItemWriter 关联的步骤。 Think of ItemReader -ItemWriter pair as of Map- Reduce.将 ItemReader -ItemWriter 对视为 Map-Reduce。 You can achieve the neccessary effect by using custom reader and writer with propper line aggregation.您可以通过使用带有适当行聚合的自定义读取器和写入器来实现必要的效果。 It might be a good idea for your reader/writer to implement Stream interface to guarantee intermediate StepContext save operation by Spring batch.对于您的读写器来说，实现 Stream 接口以保证 Spring 批处理的中间 StepContext 保存操作可能是一个好主意。

I tried it just for fun, but i think that it is pointless since your working capacity is limited by single JVM, in other words: you could not reach Hadoop cluster (or other real map reduce implementationns) production environment performance.我只是为了好玩而尝试它，但我认为这是没有意义的，因为你的工作能力受到单个 JVM 的限制，换句话说：你无法达到 Hadoop 集群（或其他真正的 Z1D78DC8ED51214E518B5114FE24490 降低生产环境的实现）生产环境。 Also it will be really hard to be scallable as your data size grows.此外，随着数据大小的增长，将很难进行可扩展。

Nice observation but IMO currently useless for real world tasks.很好的观察，但 IMO 目前对现实世界的任务无用。

I feel that a batch processing framework should separate programming/configuration and run-time concerns.It would be nice if spring batch provides a generic solution over a all major batch processing run times like JVM, Hadoop Cluster(also uses JVM) etc.我觉得批处理框架应该将编程/配置和运行时问题分开。如果 spring 批处理在所有主要批处理运行时间（如 JVM、Z53EB3DCFBB4C2109464FE1A985D7 等）上提供通用解决方案，那就太好了。

-> Write batch programs using Spring batch programming/Configuration model that integrates other programming models like map-reduce,traditional java etc. -> 使用 Spring 批量编程/配置 model 编写批处理程序，集成了其他编程模型，如 map-reduce、传统 java 等。

-> Select the run-times based on your need (single JVM or Hadoop Cluster or NoSQL). -> Select 根据您的需要运行时（单个 JVM 或 Hadoop 集群或 NoSQL）。

Spring Data attempts solve a part of it, providing a unified configuration model and API usage for various type of data sources.). Spring 数据尝试解决一部分，提供统一配置model 和 API 使用各种类型的数据源。）。