没有输入数据集的Spark作业

Question

I want to write a Spark job that produces millions of random numbers as output. 我想编写一个Spark作业，该作业产生数百万个随机数作为输出。 This does not need an input dataset, but it would be good to have the parallelism of a cluster. 这不需要输入数据集，但是最好具有集群的并行性。

I understand that Spark runs on RDD which are datasets by definition, I am just wondering if there is a way to force many executors to run a specific function without having an RDD, or by creating a mock RDD. 我知道Spark在定义上是数据集的RDD上运行，我只是想知道是否有一种方法可以强制许多执行程序在没有RDD的情况下运行特定功能，或者是通过创建模拟RDD来实现。

Answer 1

sc.parallelize(Seq(1000, 1000, 1000))
.repartition(3)
.flatMap({count => 0.to(count).map(_ => Random.nextInt)})

没有输入数据集的Spark作业

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-03-16 16:53:07

没有输入数据集的Spark作业

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-03-16 16:53:07

解决方案1
0 已采纳 2016-03-16 16:53:07