[英]Apache Spark cheapest way to trigger a RDD transformation
I'm totally new to Apache Spark and I've set up a standalone cluster to run a sorting algorithm for big amounts of data (Integer numbers).我是 Apache Spark 的新手,我已经建立了一个独立的集群来运行大量数据(整数)的排序算法。
I have it working the way I want.我让它按照我想要的方式工作。 The core is as follows:
核心如下:
JavaRDD<Integer> rdd = ctx
.parallelize(Collections.<Integer>emptyList(), PARTITIONS)
.mapPartitions(partition ->
ThreadLocalRandom
.current()
.ints(NUMBERS_PER_PARTITION, Integer.MIN_VALUE, Integer.MAX_VALUE)
.boxed()
.parallel()
.collect(Collectors.toList()))
.sortBy(x -> x, true, PARTITIONS);
This will generate random numbers in the cluster and then sort them.这将在集群中生成随机数,然后对它们进行排序。
The problem is that I am only interested in the sorting time for an experiment, but Spark is lazy and the sorting will only be triggered with a given action.问题是我只对实验的排序时间感兴趣,但 Spark 是懒惰的,排序只会被给定的动作触发。 I'm using
count()
to trigger the sorting, but it takes a very long time to finish the counting, therefore it delays my experiment.我使用
count()
来触发排序,但是完成计数需要很长时间,因此它延迟了我的实验。 I don't care about getting the sorted numbers, or even a sample of it, since I already know it's sorting correctly.我不在乎得到排序的数字,甚至是它的样本,因为我已经知道它正在正确排序。
Is there a way that I can trigger the .sortBy()
without having to wait for the action that triggered it to finish?有没有一种方法可以触发
.sortBy()
而不必等待触发它的操作完成? And if there isn't, is there a cheaper action than count()
?如果没有,是否有比
count()
更便宜的操作?
sort is a lazy spark transformations sort 是一个懒惰的火花转换
you can use one of the non lazy return values to trigger the action您可以使用非惰性返回值之一来触发操作
you already tried count which is taking a lot of time您已经尝试过需要花费大量时间的计数
try: first() or take(n)尝试: first() 或 take(n)
Here is a list of lazy / non-lazy actions这是惰性/非惰性操作的列表
https://www.mapr.com/ebooks/spark/apache-spark-cheat-sheet.html https://www.mapr.com/ebooks/spark/apache-spark-cheat-sheet.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.