简体繁体 English

与Apache Spark嵌套嵌套？

[英]nested iterations with Apache Spark?

原文 2015-04-12 20:08:57 8 1 java/ machine-learning/ apache-spark

I'm considering Apache Spark (in java) for a project, but this project requires the data processing framework to support nested iterations. 我正在考虑将Apache Spark（在Java中）用于一个项目，但是该项目需要数据处理框架来支持嵌套迭代。 I haven't been able to find any confirmation on that, does it support it? 我还没有找到任何证实，它是否支持？ In addition, is there any example of the use of nested iterations? 另外，是否有使用嵌套迭代的示例？

Thanks! 谢谢！

1 个解决方案

Just about anything can be done, but the question is what fits the execution model well enough to bother. 几乎所有事情都可以做，但是问题是什么完全适合执行模型以至于不易打扰。 Spark's operations are inherently parallel, not iterative. Spark的操作本质上是并行的，而不是迭代的。 That is, some operations happens in parallel to a bunch of pieces of the data, rather than, something happens to each piece sequentially (and then happens again). 也就是说，某些操作与一堆数据并行发生，而不是顺序地发生在每个数据块上（然后再次发生）。

However a Spark (driver) program is just a program and can do whatever you want, locally. 但是，Spark（驱动程序）程序只是一个程序，可以在本地执行任何您想执行的操作。 Of course, nested loops or whatever you like are entirely fine just as in any scala program. 当然，就像在任何scala程序中一样，嵌套循环或任何您喜欢的东西都完全可以。

I think you might use Spark operations for the bucketing process and to compute summary stats for each bucket, but otherwise run the simple remainder of the logic locally on the driver. 我认为您可以将Spark操作用于存储过程并计算每个存储桶的摘要统计信息，但否则可以在驱动程序本地运行逻辑的简单剩余部分。

So the process is: 因此，过程为：

Broadcast a bucketing scheme 广播存储方案
Bucket according to that scheme in a distributed operation 在分布式操作中根据该方案存储桶
Pull small summary stats to the driver 将少量摘要统计信息发送给驱动程序
Update bucketing scheme and send again 更新存储方案并再次发送
repeat... 重复...