简体   繁体   English

Apache Spark 最便宜的触发 RDD 转换的方法

[英]Apache Spark cheapest way to trigger a RDD transformation

I'm totally new to Apache Spark and I've set up a standalone cluster to run a sorting algorithm for big amounts of data (Integer numbers).我是 Apache Spark 的新手,我已经建立了一个独立的集群来运行大量数据(整数)的排序算法。

I have it working the way I want.我让它按照我想要的方式工作。 The core is as follows:核心如下:

JavaRDD<Integer> rdd = ctx
                .parallelize(Collections.<Integer>emptyList(), PARTITIONS)
                .mapPartitions(partition ->
                        ThreadLocalRandom
                                .current()
                                .ints(NUMBERS_PER_PARTITION, Integer.MIN_VALUE, Integer.MAX_VALUE)
                                .boxed()

                                .parallel()
                                .collect(Collectors.toList()))
                .sortBy(x -> x, true, PARTITIONS);

This will generate random numbers in the cluster and then sort them.这将在集群中生成随机数,然后对它们进行排序。

The problem is that I am only interested in the sorting time for an experiment, but Spark is lazy and the sorting will only be triggered with a given action.问题是我对实验的排序时间感兴趣,但 Spark 是懒惰的,排序只会被给定的动作触发。 I'm using count() to trigger the sorting, but it takes a very long time to finish the counting, therefore it delays my experiment.我使用count()来触发排序,但是完成计数需要很长时间,因此它延迟了我的实验。 I don't care about getting the sorted numbers, or even a sample of it, since I already know it's sorting correctly.我不在乎得到排序的数字,甚至是它的样本,因为我已经知道它正在正确排序。

Is there a way that I can trigger the .sortBy() without having to wait for the action that triggered it to finish?有没有一种方法可以触发.sortBy()而不必等待触发它的操作完成? And if there isn't, is there a cheaper action than count() ?如果没有,是否有比count()更便宜的操作?

sort is a lazy spark transformations sort 是一个懒惰的火花转换
you can use one of the non lazy return values to trigger the action您可以使用非惰性返回值之一来触发操作
you already tried count which is taking a lot of time您已经尝试过需要花费大量时间的计数
try: first() or take(n)尝试: first() 或 take(n)

Here is a list of lazy / non-lazy actions这是惰性/非惰性操作的列表

https://www.mapr.com/ebooks/spark/apache-spark-cheat-sheet.html https://www.mapr.com/ebooks/spark/apache-spark-cheat-sheet.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM