[英]nested iterations with Apache Spark?
I'm considering Apache Spark (in java) for a project, but this project requires the data processing framework to support nested iterations. 我正在考虑将Apache Spark(在Java中)用于一个项目,但是该项目需要数据处理框架来支持嵌套迭代。 I haven't been able to find any confirmation on that, does it support it?
我还没有找到任何证实,它是否支持? In addition, is there any example of the use of nested iterations?
另外,是否有使用嵌套迭代的示例?
Thanks! 谢谢!
Just about anything can be done, but the question is what fits the execution model well enough to bother. 几乎所有事情都可以做,但是问题是什么完全适合执行模型以至于不易打扰。 Spark's operations are inherently parallel, not iterative.
Spark的操作本质上是并行的,而不是迭代的。 That is, some operations happens in parallel to a bunch of pieces of the data, rather than, something happens to each piece sequentially (and then happens again).
也就是说,某些操作与一堆数据并行发生,而不是顺序地发生在每个数据块上(然后再次发生)。
However a Spark (driver) program is just a program and can do whatever you want, locally. 但是,Spark(驱动程序)程序只是一个程序,可以在本地执行任何您想执行的操作。 Of course, nested loops or whatever you like are entirely fine just as in any scala program.
当然,就像在任何scala程序中一样,嵌套循环或任何您喜欢的东西都完全可以。
I think you might use Spark operations for the bucketing process and to compute summary stats for each bucket, but otherwise run the simple remainder of the logic locally on the driver. 我认为您可以将Spark操作用于存储过程并计算每个存储桶的摘要统计信息,但否则可以在驱动程序本地运行逻辑的简单剩余部分。
So the process is: 因此,过程为:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.