简体   繁体   English

与Apache Spark嵌套嵌套?

[英]nested iterations with Apache Spark?

I'm considering Apache Spark (in java) for a project, but this project requires the data processing framework to support nested iterations. 我正在考虑将Apache Spark(在Java中)用于一个项目,但是该项目需要数据处理框架来支持嵌套迭代。 I haven't been able to find any confirmation on that, does it support it? 我还没有找到任何证实,它是否支持? In addition, is there any example of the use of nested iterations? 另外,是否有使用嵌套迭代的示例?

Thanks! 谢谢!

Just about anything can be done, but the question is what fits the execution model well enough to bother. 几乎所有事情都可以做,但是问题是什么完全适合执行模型以至于不易打扰。 Spark's operations are inherently parallel, not iterative. Spark的操作本质上是并行的,而不是迭代的。 That is, some operations happens in parallel to a bunch of pieces of the data, rather than, something happens to each piece sequentially (and then happens again). 也就是说,某些操作与一堆数据并行发生,而不是顺序地发生在每个数据块上(然后再次发生)。

However a Spark (driver) program is just a program and can do whatever you want, locally. 但是,Spark(驱动程序)程序只是一个程序,可以在本地执行任何您想执行的操作。 Of course, nested loops or whatever you like are entirely fine just as in any scala program. 当然,就像在任何scala程序中一样,嵌套循环或任何您喜欢的东西都完全可以。

I think you might use Spark operations for the bucketing process and to compute summary stats for each bucket, but otherwise run the simple remainder of the logic locally on the driver. 我认为您可以将Spark操作用于存储过程并计算每个存储桶的摘要统计信息,但否则可以在驱动程序本地运行逻辑的简单剩余部分。

So the process is: 因此,过程为:

  • Broadcast a bucketing scheme 广播存储方案
  • Bucket according to that scheme in a distributed operation 在分布式操作中根据该方案存储桶
  • Pull small summary stats to the driver 将少量摘要统计信息发送给驱动程序
  • Update bucketing scheme and send again 更新存储方案并再次发送
  • repeat... 重复...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以在Apache Spark中创建嵌套的RDD? - Is it possible to create nested RDDs in Apache Spark? Apache Spark Streaming K-means:我需要知道在同一数据中运行了多少次迭代? - Apache Spark Streaming K-means: I need know how many iterations runs in the same data? 从嵌套JSON选择特定列时Apache Spark失败 - Apache Spark fails while selecting particular columns from nested JSON 与开关嵌套的for循环跳过迭代 - for loop nested with switch skips iterations 减少嵌套列表循环的迭代 - Reducing iterations on nested list loops 不使用迭代的Spark RDD计算 - Calculations on Spark RDD without using Iterations Apache Spark:如何在Java中访问Tuple2对象中的嵌套整数数组? - Apache Spark: How can i access nested array of integers within Tuple2 object in Java? 在Apache Spark中阅读多行JSON文件后,如何获取嵌套属性作为列? - After reading Multi-line JSON file in Apache Spark, how to get nested attributes as columns? 我正在使用Apache Spark解析json文件。 如何从json文件获取嵌套键,知道它是数组还是嵌套键 - I am using apache spark to parse json files . How to get nested key from json files wheather it's array or nested key Java Apache Spark mllib - java apache spark mllib
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM