[英]How to parallelize an algorithm in java using Spark?
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.commons.lang.time.StopWatch;
import java.util.ArrayList;
import java.util.List;
public class Prime {
//Method to calculate and count the prime numbers
public List<Integer> countPrime(int n){
List<Integer> primes = new ArrayList<>();
for (int i = 2; i < n; i++){
boolean isPrime = true;
//check if the number is prime or not
for (int j = 2; j < i; j++){
if (i % j == 0){
isPrime = false;
break; // exit the inner for loop
}
}
//add the primes into the List
if (isPrime){
primes.add(i);
}
}
return primes;
}
//Main method to run the program
public static void main(String[]args){
StopWatch watch = new StopWatch();
watch.start();
//creating javaSparkContext object
SparkConf conf = new SparkConf().setAppName("haha").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
//new prime object
Prime prime = new Prime();
//prime.countPrime(1000000);
//parallelize the collection
JavaRDD<Integer> rdd = sc.parallelize(prime.countPrime(1000000),12);
long count = rdd.filter(e -> e == 2|| e % 2 != 0).count();
//Stopping the execution time and printing the results
watch.stop();
System.out.println("Total time took to run the process is " + watch);
System.out.println("The number of prime between 0 to 1000000 is " + count);
sc.stop();
}
}
Hi there, i have this following code which parallelize an algorithm.您好,我有以下代码可以并行化算法。 The algorithm counts the number of prime in a given range.
该算法计算给定范围内素数的数量。 But the code is only parallelizing the list of primes but not the process itself.
但是代码只是并行化素数列表而不是进程本身。 How can modify the code to parallelize process of finding the primes?
如何修改代码以并行化查找素数的过程?
It's an order of operations issue - you're running prime.CountPrime
before you've created your Spark RDD.这是一个操作顺序问题——您在创建 Spark RDD 之前正在运行
prime.CountPrime
。 Spark runs operations in parallel that are defined within the RDD object's map
, reduce
, filter
, etc. operations. Spark 并行运行在 RDD 对象的
map
、 reduce
、 filter
等操作中定义的操作。 You need to rethink your approach:您需要重新考虑您的方法:
Use sc.range(1, 1000000, 1, 12)
to create an RDD of all integers from 1 to 1,000,000.使用
sc.range(1, 1000000, 1, 12)
创建一个包含 1 到 1,000,000 之间所有整数的 RDD。
Create an isPrime(int n)
method to evaluate if a given integer is prime.创建一个
isPrime(int n)
方法来评估给定的 integer 是否为素数。
filter
your RDD on the condition of your isPrime
method (this is the part that will execute in parallel).根据您的
isPrime
方法filter
您的 RDD(这是将并行执行的部分)。
count
the filtered RDD. count
过滤后的 RDD。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.