[英]Multithreaded Segmented Sieve of Eratosthenes in Java
I am trying to create a fast prime generator in Java. 我正在尝试用Java创建一个快速素数生成器。 It is (more or less) accepted that the fastest way for this is the segmented sieve of Eratosthenes: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes .
它(或多或少)被接受,最快的方法是Eratosthenes的分段筛: https : //en.wikipedia.org/wiki/Sieve_of_Eratosthenes 。 Lots of optimizations can be further implemented to make it faster.
可以进一步实施大量优化以使其更快。 As of now, my implementation generates
50847534
primes below 10^9
in about 1.6
seconds , but I am looking to make it faster and at least break the 1
second barrier. 到目前为止,我的实现在大约
1.6
秒内产生了50847534
低于10^9
50847534
数 ,但我希望它更快,至少打破1
秒的障碍。 To increase the chance of getting good replies, I will include a walkthrough of the algorithm as well as the code. 为了增加获得良好回复的机会,我将包括算法和代码的演练。
Still, as a TL;DR
, I am looking to include multi-threading into the code 尽管如此,作为
TL;DR
,我希望在代码中包含多线程
For the purposes of this question, I want to separate between the 'segmented' and the 'traditional' sieves of Eratosthenes. 出于这个问题的目的,我想区分Eratosthenes的'分段'和'传统'筛子。 The traditional sieve requires
O(n)
space and therefore is very limited in range of the input (the limit of it). 传统的筛网需要
O(n)
空间,因此在输入范围内(其极限)非常有限。 The segmented sieve however only requires O(n^0.5)
space and can operate on much larger limits. 然而,分段筛仅需要
O(n^0.5)
空间并且可以在更大的限制下操作。 (A main speed-up is using a cache-friendly segmentation, taking into account the L1 & L2
cache sizes of the specific computer). (主要的加速是使用缓存友好的分段,考虑到特定计算机的
L1 & L2
缓存大小)。 Finally, the main difference that concerns my question is that the traditional sieve is sequential, meaning it can only continue once the previous steps are completed. 最后,与我的问题有关的主要区别是传统的筛子是顺序的,这意味着它只能在前面的步骤完成后才能继续。 The segmented sieve however, is not.
然而,分段筛不是。 Each segment is independent, and is 'processed' individually against the sieving primes (the primes not larger than
n^0.5
). 每个段都是独立的,并且针对筛分质粒(不大于
n^0.5
的质数)单独“处理”。 This means that theoretically, once I have the sieving primes, I can divide the work between multiple computers, each processing a different segment. 这意味着理论上,一旦我有了筛选素数,我就可以在多台计算机之间划分工作,每台计算机处理一个不同的段。 The work of eachother is independent of the others.
彼此的工作独立于其他人。 Assuming (wrongly) that each segment requires the same amount of time
t
to complete, and there are k
segments, One computer would require total time of T = k * t
, whereas k
computers, each working on a different segment would require a total amount of time T = t
to complete the entire process. 假设(错误地)每个段需要相同的时间
t
来完成,并且有k
段,一台计算机将需要总时间T = k * t
,而k
计算机,每个工作在不同的段上将需要总计完成整个过程的时间量T = t
。 (Practically, this is wrong, but for the sake of simplicity of the example). (实际上,这是错误的,但为了简化示例)。
This brought me to reading about multithreading - dividing the work to a few threads each processing a smaller amount of work for better usage of CPU. 这让我开始阅读多线程 - 将工作划分为几个线程,每个线程处理较少量的工作以更好地使用CPU。 To my understanding, the traditional sieve cannot be multithreaded exactly because it is sequential.
根据我的理解,传统的筛子不能完全是多线程的,因为它是顺序的。 Each thread would depend on the previous, rendering the entire idea unfeasible.
每个线程都依赖于前一个,使整个想法变得不可行。 But a segmented sieve may indeed (I think) be multithreaded.
但分段筛可能(我认为)可能是多线程的。
Instead of jumping straight into my question, I think it is important to introduce my code first, so I am hereby including my current fastest implementation of the segmented sieve. 我没有直接跳到我的问题中,而是认为首先介绍我的代码非常重要,所以我特此包括我目前最快的分段筛的实现。 I have worked quite hard on it.
我对此非常努力。 It took quite some time, slowly tweaking and adding optimizations to it.
花了很长时间,慢慢地调整并添加优化。 The code is not simple.
代码并不简单。 It is rather complex, I would say.
我想说,它相当复杂。 I therefore assume the reader is familiar with the concepts I am introducing, such as wheel factorization, prime numbers, segmentation and more.
因此,我假设读者熟悉我所介绍的概念,例如车轮分解,素数,分段等。 I have included notes to make it easier to follow.
我已经添加了注释,以便更容易理解。
import java.math.BigInteger;
import java.util.ArrayList;
import java.util.Arrays;
public class primeGen {
public static long x = (long)Math.pow(10, 9); //limit
public static int sqrtx;
public static boolean [] sievingPrimes; //the sieving primes, <= sqrtx
public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes
public static int [] gaps; //the gaps, according to the wheel. will enable skipping multiples of the wheel primes
public static int nextp; // the first prime > wheel primes
public static int l; // the amount of gaps in the wheel
public static void main(String[] args)
{
long startTime = System.currentTimeMillis();
preCalc(); // creating the sieving primes and calculating the list of gaps
int segSize = Math.max(sqrtx, 32768*8); //size of each segment
long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes
long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx
for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
{
long high = Math.min(x, low + segSize);
boolean [] segment = new boolean [(int) (high - low + 1)];
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if (sievingPrimes[(i + 1) / 2])
{
long firstMultiple = (long) (low / i * i);
if (firstMultiple < low)
firstMultiple += i;
if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
firstMultiple += i;
for (long j = firstMultiple ; j < high ; j += i * 2)
segment[(int) (j - low)] = true;
}
g++;
//if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant.
//should be used with bigger segment sizes or smaller lists of gaps
//g = 0;
}
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
}
System.out.println(pi);
long endTime = System.currentTimeMillis();
System.out.println("Solution took "+(endTime - startTime) + " ms");
}
public static boolean [] simpleSieve (int l)
{
long sqrtl = (long)Math.sqrt(l);
boolean [] primes = new boolean [l/2+2];
Arrays.fill(primes, true);
int g = -1;
for (int i = nextp ; i <= sqrtl ; i += gaps[g])
{
if (primes[(i + 1) / 2])
for (int j = i * i ; j <= l ; j += i * 2)
primes[(j + 1) / 2]=false;
g++;
if (g == l)
g=0;
}
return primes;
}
public static long pisqrtx ()
{
int pi = wheels.length;
if (x < wheels[wheels.length-1])
{
if (x < 2)
return 0;
int k = 0;
while (wheels[k] <= x)
k++;
return k;
}
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if(sievingPrimes[( i + 1 ) / 2])
pi++;
g++;
if (g == l)
g=0;
}
return pi;
}
public static void preCalc ()
{
sqrtx = (int) Math.sqrt(x);
int prod = 1;
for (long p : wheels)
prod *= p; // primorial
nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
int lim = prod + nextp; // circumference of the wheel
boolean [] marks = new boolean [lim + 1];
Arrays.fill(marks, true);
for (int j = 2 * 2 ;j <= lim ; j += 2)
marks[j] = false;
for (int i = 1 ; i < wheels.length ; i++)
{
int p = wheels[i];
for (int j = p * p ; j <= lim ; j += 2 * p)
marks[j]=false; // removing all integers that are NOT comprime with the base wheel primes
}
ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
int d = nextp;
for (int p = d + 2 ; p < marks.length ; p += 2)
{
if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
{
gs.add(p - d);
d = p;
}
}
gaps = new int [gs.size()];
for (int i = 0 ; i < gs.size() ; i++)
gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
l = gaps.length;
sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
}
}
Currently, it produces 50847534
primes below 10^9
in about 1.6
seconds. 目前,它在约
1.6
秒内产生50847534
低于10^9
50847534
数。 This is very impressive, at least by my standards, but I am looking to make it faster, possibly break the 1
second barrier. 这是非常令人印象深刻的,至少按照我的标准,但我希望让它更快,可能打破
1
秒的障碍。 Even then, I believe it can be made much faster still. 即便如此,我相信它仍然可以更快。
The whole program is based on wheel factorization : https://en.wikipedia.org/wiki/Wheel_factorization . 整个程序基于车轮分解 : https : //en.wikipedia.org/wiki/Wheel_factorization 。 I have noticed I am getting the fastest results using a wheel of all primes up to
19
. 我注意到我使用一个高达
19
的所有素数的轮子获得最快的结果。
public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes
This means that the multiples of those primes are skipped, resulting in a much smaller searching range. 这意味着跳过这些素数的倍数,从而导致搜索范围小得多。 The gaps between numbers which we need to take are then calculated in the
preCalc
method. 然后在
preCalc
方法中计算我们需要采用的数字之间的差距。 If we make those jumps between the the numbers in the searching range we skip the multiples of the base primes. 如果我们在搜索范围内的数字之间进行跳转,我们会跳过基本素数的倍数。
public static void preCalc ()
{
sqrtx = (int) Math.sqrt(x);
int prod = 1;
for (long p : wheels)
prod *= p; // primorial
nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
int lim = prod + nextp; // circumference of the wheel
boolean [] marks = new boolean [lim + 1];
Arrays.fill(marks, true);
for (int j = 2 * 2 ;j <= lim ; j += 2)
marks[j] = false;
for (int i = 1 ; i < wheels.length ; i++)
{
int p = wheels[i];
for (int j = p * p ; j <= lim ; j += 2 * p)
marks[j]=false; // removing all integers that are NOT comprime with the base wheel primes
}
ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
int d = nextp;
for (int p = d + 2 ; p < marks.length ; p += 2)
{
if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
{
gs.add(p - d);
d = p;
}
}
gaps = new int [gs.size()];
for (int i = 0 ; i < gs.size() ; i++)
gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
l = gaps.length;
sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
}
At the end of the preCalc
method, the simpleSieve
method is called, efficiently sieving all the sieving primes mentioned before, the primes <= sqrtx
. 在
preCalc
方法结束时, preCalc
simpleSieve
方法,有效地筛选之前提到的所有筛分素数,素数<= sqrtx
。 This is a simple Eratosthenes sieve, rather than segmented, but it is still based on wheel factorization , perviously computed. 这是一个简单的Eratosthenes筛,而不是分段,但它仍然基于轮分解 ,可以通过计算。
public static boolean [] simpleSieve (int l)
{
long sqrtl = (long)Math.sqrt(l);
boolean [] primes = new boolean [l/2+2];
Arrays.fill(primes, true);
int g = -1;
for (int i = nextp ; i <= sqrtl ; i += gaps[g])
{
if (primes[(i + 1) / 2])
for (int j = i * i ; j <= l ; j += i * 2)
primes[(j + 1) / 2]=false;
g++;
if (g == l)
g=0;
}
return primes;
}
Finally, we reach the heart of the algorithm. 最后,我们到达了算法的核心。 We start by enumerating all primes
<= sqrtx
, with the following call: 我们首先枚举所有素数
<= sqrtx
,并进行以下调用:
long pi = pisqrtx();`
which used the following method: 使用以下方法:
public static long pisqrtx ()
{
int pi = wheels.length;
if (x < wheels[wheels.length-1])
{
if (x < 2)
return 0;
int k = 0;
while (wheels[k] <= x)
k++;
return k;
}
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if(sievingPrimes[( i + 1 ) / 2])
pi++;
g++;
if (g == l)
g=0;
}
return pi;
}
Then, after initializing the pi
variable which keeps track of the enumeration of primes, we perform the mentioned segmentation, starting the enumeration from the first prime > sqrtx
: 然后,在初始化跟踪素数枚举的
pi
变量之后,我们执行上面提到的分段,从第一个prime > sqrtx
开始枚举:
int segSize = Math.max(sqrtx, 32768*8); //size of each segment
long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes
long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx
for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
{
long high = Math.min(x, low + segSize);
boolean [] segment = new boolean [(int) (high - low + 1)];
int g = -1;
for (int i = nextp ; i <= sqrtx ; i += gaps[g])
{
if (sievingPrimes[(i + 1) / 2])
{
long firstMultiple = (long) (low / i * i);
if (firstMultiple < low)
firstMultiple += i;
if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
firstMultiple += i;
for (long j = firstMultiple ; j < high ; j += i * 2)
segment[(int) (j - low)] = true;
}
g++;
//if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant.
//should be used with bigger segment sizes or smaller lists of gaps
//g = 0;
}
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
}
I have also included it as a note, but will explain as well. 我还把它作为一个注释包括在内,但也会解释。 Because the segment size is relatively small, we will not go through the entire list of gaps within just one segment, and checking it - is redundant.
由于段大小相对较小,我们不会在一个段内查看完整的间隙列表,并且检查它是多余的。 (Assuming we use a
19-wheel
). (假设我们使用的是
19-wheel
)。 But in a broader scope overview of the program, we will make use of the entire array of gaps, so the variable u
has to follow it and not accidentally surpass it: 但是在更广泛的范围概述中,我们将利用整个差距,因此变量
u
必须遵循它而不是意外地超越它:
while (u <= high)
{
if (!segment[(int) (u - low)])
pi++;
u += gaps[wh];
wh++;
if (wh == l)
wh = 0;
}
Using higher limits will eventually render a bigger segment, which might result in a neccessity of checking we don't surpass the gaps list even within the segment. 使用更高的限制将最终呈现更大的片段,这可能导致检查的必要性,即使在片段内,我们也不会超过间隙列表。 This, or tweaking the
wheel
primes base might have this effect on the program. 这个,或调整
wheel
素数基数可能会对程序产生这种影响。 Switching to bit-sieving can largely improve the segment limit though. 切换到比特筛分可以在很大程度上改善分段限制。
L1 & L2
cache-sizes into account. L1 & L2
缓存大小的分段。 I get the fastest results using a segment size of 32,768 * 8 = 262,144 = 2^18
. 32,768 * 8 = 262,144 = 2^18
获得最快的结果。 I am not sure what the cache-size of my computer is, but I do not think it can be that big, as I see most cache sizes <= 32,768
. <= 32,768
。 Still, this produces the fastest run time on my computer, so this is why it's the chosen segment size. 4
, using 4 threads (corresponding to 4 cores). 4
。 The idea is that each thread will still use the idea of the segmented sieve, but work on different portions
. portions
工作。 Divide the n
into 4
equal portions - threads, each in turn performing the segmentation on the n/4
elements it is responsible for, using the above program. n
分成4
相等的部分 - 线程,每个线程依次使用上述程序对其负责的n/4
元素执行分段。 My question is how do I do that? Additionally, I would like to hear about more ways of speeding-up this program even more, any ideas you have, I would love to hear! 另外,我想了解更多关于加速这个项目的方法,你有任何想法,我很乐意听到! Really want to make it very fast and efficient.
真的想让它变得非常快速和高效。 Thank you!
谢谢!
An example like this should help you get started. 这样的例子可以帮助您入门。
An outline of a solution: 解决方案概述:
Extra speedup may (or may not) be achieved by joining the results in a separate task that reads the output queue, or even by updating a mutable shared output structure under synchronized
, depending on how much work the joining step involves. 通过将结果连接到读取输出队列的单独任务中,或者甚至通过在
synchronized
下更新可变共享输出结构,可以(或可以不)实现额外加速,这取决于加入步骤涉及多少工作。
Hope this helps. 希望这可以帮助。
Are you familiar with the work of Tomas Oliveira e Silva? 你熟悉Tomas Oliveira e Silva的作品吗? He has a very fast implementation of the Sieve of Eratosthenes.
他对Eratosthenes筛子的实施速度非常快 。
How interested in speed are you? 你对速度有多感兴趣? Would you consider using c++?
你会考虑使用c ++吗?
$ time ../c_code/segmented_bit_sieve 1000000000
50847534 primes found.
real 0m0.875s
user 0m0.813s
sys 0m0.016s
$ time ../c_code/segmented_bit_isprime 1000000000
50847534 primes found.
real 0m0.816s
user 0m0.797s
sys 0m0.000s
(on my newish laptop with an i5) (在我的i5新款笔记本电脑上)
The first is from @Kim Walisch using a bit array of odd prime candidates. 第一个来自@Kim Walisch使用了一些奇数候选人。
https://github.com/kimwalisch/primesieve/wiki/Segmented-sieve-of-Eratosthenes https://github.com/kimwalisch/primesieve/wiki/Segmented-sieve-of-Eratosthenes
The second is my tweak to Kim's with IsPrime[] also implemented as bit array, which is slightly less clear to read, although a little faster for big N due to the reduced memory footprint. 第二个是我对Kim的调整,IsPrime []也实现为位数组,虽然读取稍微不那么清楚,但由于内存占用减少,大N的速度稍快一些。
I will read your post carefully as I am interested in primes and performance no matter what language is used. 我将仔细阅读您的帖子,因为无论使用何种语言,我都对素数和表现感兴趣。 I hope this isn't too far off topic or premature.
我希望这不是太偏离主题或过早。 But I noticed I was already beyond your performance goal.
但我注意到我已经超出了你的表现目标。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.