简体   繁体   English

没有输入数据集的Spark作业

[英]Spark job with no input dataset

I want to write a Spark job that produces millions of random numbers as output. 我想编写一个Spark作业,该作业产生数百万个随机数作为输出。 This does not need an input dataset, but it would be good to have the parallelism of a cluster. 这不需要输入数据集,但是最好具有集群的并行性。

I understand that Spark runs on RDD which are datasets by definition, I am just wondering if there is a way to force many executors to run a specific function without having an RDD, or by creating a mock RDD. 我知道Spark在定义上是数据集的RDD上运行,我只是想知道是否有一种方法可以强制许多执行程序在没有RDD的情况下运行特定功能,或者是通过创建模拟RDD来实现。

sc.parallelize(Seq(1000, 1000, 1000))
.repartition(3)
.flatMap({count => 0.to(count).map(_ => Random.nextInt)})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM