简体   繁体   English

如何使用 Spark 并行化 java 中的算法?

[英]How to parallelize an algorithm in java using Spark?

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.commons.lang.time.StopWatch;
import java.util.ArrayList;
import java.util.List;

public class Prime {

    //Method to calculate and count the prime numbers
    public List<Integer> countPrime(int n){
        List<Integer> primes = new ArrayList<>();
        for (int i = 2; i < n; i++){
            boolean  isPrime = true;

            //check if the number is prime or not
            for (int j = 2; j < i; j++){
                if (i % j == 0){
                    isPrime = false;
                    break;  // exit the inner for loop
                }
            }

            //add the primes into the List
            if (isPrime){
                primes.add(i);
            }
        }
        return primes;
    }

    //Main method to run the program
    public static void main(String[]args){
        StopWatch watch = new StopWatch();
        watch.start();

        //creating javaSparkContext object
        SparkConf conf = new SparkConf().setAppName("haha").setMaster("local[2]");
        JavaSparkContext sc = new JavaSparkContext(conf);

        //new prime object
        Prime prime = new Prime();
        //prime.countPrime(1000000);

        //parallelize the collection
        JavaRDD<Integer> rdd = sc.parallelize(prime.countPrime(1000000),12);
        long count = rdd.filter(e  -> e == 2|| e % 2 != 0).count();


        //Stopping the execution time and printing the results
        watch.stop();
        System.out.println("Total time took to run the process is " + watch);
        System.out.println("The number of prime between 0 to 1000000  is " + count);
        sc.stop();
    }
}

Hi there, i have this following code which parallelize an algorithm.您好,我有以下代码可以并行化算法。 The algorithm counts the number of prime in a given range.该算法计算给定范围内素数的数量。 But the code is only parallelizing the list of primes but not the process itself.但是代码只是并行化素数列表而不是进程本身。 How can modify the code to parallelize process of finding the primes?如何修改代码以并行化查找素数的过程?

It's an order of operations issue - you're running prime.CountPrime before you've created your Spark RDD.这是一个操作顺序问题——您在创建 Spark RDD 之前正在运行prime.CountPrime Spark runs operations in parallel that are defined within the RDD object's map , reduce , filter , etc. operations. Spark 并行运行在 RDD 对象的mapreducefilter等操作中定义的操作。 You need to rethink your approach:您需要重新考虑您的方法:

  1. Use sc.range(1, 1000000, 1, 12) to create an RDD of all integers from 1 to 1,000,000.使用sc.range(1, 1000000, 1, 12)创建一个包含 1 到 1,000,000 之间所有整数的 RDD。

  2. Create an isPrime(int n) method to evaluate if a given integer is prime.创建一个isPrime(int n)方法来评估给定的 integer 是否为素数。

  3. filter your RDD on the condition of your isPrime method (this is the part that will execute in parallel).根据您的isPrime方法filter您的 RDD(这是将并行执行的部分)。

  4. count the filtered RDD. count过滤后的 RDD。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM