Apache Spark 最便宜的触发 RDD 转换的方法

Question

I'm totally new to Apache Spark and I've set up a standalone cluster to run a sorting algorithm for big amounts of data (Integer numbers).我是 Apache Spark 的新手，我已经建立了一个独立的集群来运行大量数据（整数）的排序算法。

I have it working the way I want.我让它按照我想要的方式工作。 The core is as follows:核心如下：

JavaRDD<Integer> rdd = ctx
                .parallelize(Collections.<Integer>emptyList(), PARTITIONS)
                .mapPartitions(partition ->
                        ThreadLocalRandom
                                .current()
                                .ints(NUMBERS_PER_PARTITION, Integer.MIN_VALUE, Integer.MAX_VALUE)
                                .boxed()

                                .parallel()
                                .collect(Collectors.toList()))
                .sortBy(x -> x, true, PARTITIONS);

This will generate random numbers in the cluster and then sort them.这将在集群中生成随机数，然后对它们进行排序。

The problem is that I am only interested in the sorting time for an experiment, but Spark is lazy and the sorting will only be triggered with a given action.问题是我只对实验的排序时间感兴趣，但 Spark 是懒惰的，排序只会被给定的动作触发。 I'm using count() to trigger the sorting, but it takes a very long time to finish the counting, therefore it delays my experiment.我使用count()来触发排序，但是完成计数需要很长时间，因此它延迟了我的实验。 I don't care about getting the sorted numbers, or even a sample of it, since I already know it's sorting correctly.我不在乎得到排序的数字，甚至是它的样本，因为我已经知道它正在正确排序。

Is there a way that I can trigger the .sortBy() without having to wait for the action that triggered it to finish?有没有一种方法可以触发.sortBy()而不必等待触发它的操作完成？ And if there isn't, is there a cheaper action than count() ?如果没有，是否有比count()更便宜的操作？

Answer 1

sort is a lazy spark transformations sort 是一个懒惰的火花转换
you can use one of the non lazy return values to trigger the action您可以使用非惰性返回值之一来触发操作
you already tried count which is taking a lot of time您已经尝试过需要花费大量时间的计数
try: first() or take(n)尝试： first() 或 take(n)

Here is a list of lazy / non-lazy actions这是惰性/非惰性操作的列表

https://www.mapr.com/ebooks/spark/apache-spark-cheat-sheet.html https://www.mapr.com/ebooks/spark/apache-spark-cheat-sheet.html

Apache Spark 最便宜的触发 RDD 转换的方法

问题描述

1 个解决方案

解决方案1
0 2016-05-03 05:08:57

Apache Spark 最便宜的触发 RDD 转换的方法

问题描述

1 个解决方案

解决方案1 0 2016-05-03 05:08:57

解决方案1
0 2016-05-03 05:08:57