Java中Eratosthenes的多线程分段筛

Question

I am trying to create a fast prime generator in Java. 我正在尝试用Java创建一个快速素数生成器。 It is (more or less) accepted that the fastest way for this is the segmented sieve of Eratosthenes: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes . 它（或多或少）被接受，最快的方法是Eratosthenes的分段筛： https ： //en.wikipedia.org/wiki/Sieve_of_Eratosthenes 。 Lots of optimizations can be further implemented to make it faster. 可以进一步实施大量优化以使其更快。 As of now, my implementation generates 50847534 primes below 10^9 in about 1.6 seconds , but I am looking to make it faster and at least break the 1 second barrier. 到目前为止，我的实现在大约1.6秒内产生了50847534低于10^9 50847534数 ，但我希望它更快，至少打破1秒的障碍。 To increase the chance of getting good replies, I will include a walkthrough of the algorithm as well as the code. 为了增加获得良好回复的机会，我将包括算法和代码的演练。

Still, as a TL;DR , I am looking to include multi-threading into the code 尽管如此，作为TL;DR ，我希望在代码中包含多线程

For the purposes of this question, I want to separate between the 'segmented' and the 'traditional' sieves of Eratosthenes. 出于这个问题的目的，我想区分Eratosthenes的'分段'和'传统'筛子。 The traditional sieve requires O(n) space and therefore is very limited in range of the input (the limit of it). 传统的筛网需要O(n)空间，因此在输入范围内（其极限）非常有限。 The segmented sieve however only requires O(n^0.5) space and can operate on much larger limits. 然而，分段筛仅需要O(n^0.5)空间并且可以在更大的限制下操作。 (A main speed-up is using a cache-friendly segmentation, taking into account the L1 & L2 cache sizes of the specific computer). （主要的加速是使用缓存友好的分段，考虑到特定计算机的L1 & L2缓存大小）。 Finally, the main difference that concerns my question is that the traditional sieve is sequential, meaning it can only continue once the previous steps are completed. 最后，与我的问题有关的主要区别是传统的筛子是顺序的，这意味着它只能在前面的步骤完成后才能继续。 The segmented sieve however, is not. 然而，分段筛不是。 Each segment is independent, and is 'processed' individually against the sieving primes (the primes not larger than n^0.5 ). 每个段都是独立的，并且针对筛分质粒（不大于n^0.5的质数）单独“处理”。 This means that theoretically, once I have the sieving primes, I can divide the work between multiple computers, each processing a different segment. 这意味着理论上，一旦我有了筛选素数，我就可以在多台计算机之间划分工作，每台计算机处理一个不同的段。 The work of eachother is independent of the others. 彼此的工作独立于其他人。 Assuming (wrongly) that each segment requires the same amount of time t to complete, and there are k segments, One computer would require total time of T = k * t , whereas k computers, each working on a different segment would require a total amount of time T = t to complete the entire process. 假设（错误地）每个段需要相同的时间t来完成，并且有k段，一台计算机将需要总时间T = k * t ，而k计算机，每个工作在不同的段上将需要总计完成整个过程的时间量T = t 。 (Practically, this is wrong, but for the sake of simplicity of the example). （实际上，这是错误的，但为了简化示例）。

This brought me to reading about multithreading - dividing the work to a few threads each processing a smaller amount of work for better usage of CPU. 这让我开始阅读多线程 - 将工作划分为几个线程，每个线程处理较少量的工作以更好地使用CPU。 To my understanding, the traditional sieve cannot be multithreaded exactly because it is sequential. 根据我的理解，传统的筛子不能完全是多线程的，因为它是顺序的。 Each thread would depend on the previous, rendering the entire idea unfeasible. 每个线程都依赖于前一个，使整个想法变得不可行。 But a segmented sieve may indeed (I think) be multithreaded. 但分段筛可能（我认为）可能是多线程的。

Instead of jumping straight into my question, I think it is important to introduce my code first, so I am hereby including my current fastest implementation of the segmented sieve. 我没有直接跳到我的问题中，而是认为首先介绍我的代码非常重要，所以我特此包括我目前最快的分段筛的实现。 I have worked quite hard on it. 我对此非常努力。 It took quite some time, slowly tweaking and adding optimizations to it. 花了很长时间，慢慢地调整并添加优化。 The code is not simple. 代码并不简单。 It is rather complex, I would say. 我想说，它相当复杂。 I therefore assume the reader is familiar with the concepts I am introducing, such as wheel factorization, prime numbers, segmentation and more. 因此，我假设读者熟悉我所介绍的概念，例如车轮分解，素数，分段等。 I have included notes to make it easier to follow. 我已经添加了注释，以便更容易理解。

import java.math.BigInteger;
import java.util.ArrayList;
import java.util.Arrays;

public class primeGen {

    public static long x = (long)Math.pow(10, 9); //limit
    public static int sqrtx;
    public static boolean [] sievingPrimes; //the sieving primes, <= sqrtx

    public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes
    public static int [] gaps; //the gaps, according to the wheel. will enable skipping multiples of the wheel primes
    public static int nextp; // the first prime > wheel primes
    public static int l; // the amount of gaps in the wheel

    public static void main(String[] args)
    {
        long startTime = System.currentTimeMillis();

        preCalc();  // creating the sieving primes and calculating the list of gaps

        int segSize = Math.max(sqrtx, 32768*8); //size of each segment
        long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
        int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes

        long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx

        for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
        {
            long high = Math.min(x, low + segSize);
            boolean [] segment = new boolean [(int) (high - low + 1)];

            int g = -1;
            for (int i = nextp ; i <= sqrtx ; i += gaps[g])
            { 
                if (sievingPrimes[(i + 1) / 2])
                {
                    long firstMultiple = (long) (low / i * i);
                    if (firstMultiple < low) 
                        firstMultiple += i; 
                    if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
                        firstMultiple += i;

                    for (long j = firstMultiple ; j < high ; j += i * 2) 
                        segment[(int) (j - low)] = true; 
                }
                g++;
                //if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant. 
                              //should be used with bigger segment sizes or smaller lists of gaps
                    //g = 0;
            }

            while (u <= high)
            {
                if (!segment[(int) (u - low)])
                    pi++;
                u += gaps[wh];
                wh++;
                if (wh == l)
                    wh = 0;
            }
        }

        System.out.println(pi);

        long endTime = System.currentTimeMillis();
        System.out.println("Solution took "+(endTime - startTime) + " ms");
    }

    public static boolean [] simpleSieve (int l)
    {
        long sqrtl = (long)Math.sqrt(l);
        boolean [] primes = new boolean [l/2+2];
        Arrays.fill(primes, true);
        int g = -1;
        for (int i = nextp ; i <= sqrtl ; i += gaps[g])
        {
            if (primes[(i + 1) / 2])
                for (int j = i * i ; j <= l ; j += i * 2)
                    primes[(j + 1) / 2]=false;
            g++;
            if (g == l)
                g=0;
        }
        return primes;
    }

    public static long pisqrtx ()
    {
        int pi = wheels.length;
        if (x < wheels[wheels.length-1])
        {
            if (x < 2)
                return 0;
            int k = 0;
            while (wheels[k] <= x)
                k++;
            return k;
        }
        int g = -1;
        for (int i = nextp ; i <= sqrtx ; i += gaps[g])
        {
            if(sievingPrimes[( i + 1 ) / 2])
                pi++;
            g++;
            if (g == l)
                g=0;
        }

        return pi;
    }

    public static void preCalc ()
    {
        sqrtx = (int) Math.sqrt(x);

        int prod = 1;
        for (long p : wheels)
            prod *= p; // primorial
        nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
        int lim = prod + nextp; // circumference of the wheel

        boolean [] marks = new boolean [lim + 1];
        Arrays.fill(marks, true);

        for (int j = 2 * 2 ;j <= lim ; j += 2)
            marks[j] = false;
        for (int i = 1 ; i < wheels.length ; i++)
        {
            int p = wheels[i];
            for (int j = p * p ; j <= lim ; j += 2 * p)
                marks[j]=false;   // removing all integers that are NOT comprime with the base wheel primes
        }
        ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
        int d = nextp;
        for (int p = d + 2 ; p < marks.length ; p += 2)
        {
            if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
            {
                gs.add(p - d);
                d = p;
            }
        }
        gaps = new int [gs.size()];
        for (int i = 0 ; i < gs.size() ; i++)
            gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
        l = gaps.length;

        sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
    }

}

Currently, it produces 50847534 primes below 10^9 in about 1.6 seconds. 目前，它在约1.6秒内产生50847534低于10^9 50847534数。 This is very impressive, at least by my standards, but I am looking to make it faster, possibly break the 1 second barrier. 这是非常令人印象深刻的，至少按照我的标准，但我希望让它更快，可能打破1秒的障碍。 Even then, I believe it can be made much faster still. 即便如此，我相信它仍然可以更快。

The whole program is based on wheel factorization : https://en.wikipedia.org/wiki/Wheel_factorization . 整个程序基于车轮分解 ： https ： //en.wikipedia.org/wiki/Wheel_factorization 。 I have noticed I am getting the fastest results using a wheel of all primes up to 19 . 我注意到我使用一个高达19的所有素数的轮子获得最快的结果。

public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes

This means that the multiples of those primes are skipped, resulting in a much smaller searching range. 这意味着跳过这些素数的倍数，从而导致搜索范围小得多。 The gaps between numbers which we need to take are then calculated in the preCalc method. 然后在preCalc方法中计算我们需要采用的数字之间的差距。 If we make those jumps between the the numbers in the searching range we skip the multiples of the base primes. 如果我们在搜索范围内的数字之间进行跳转，我们会跳过基本素数的倍数。

public static void preCalc ()
    {
        sqrtx = (int) Math.sqrt(x);

        int prod = 1;
        for (long p : wheels)
            prod *= p; // primorial
        nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
        int lim = prod + nextp; // circumference of the wheel

        boolean [] marks = new boolean [lim + 1];
        Arrays.fill(marks, true);

        for (int j = 2 * 2 ;j <= lim ; j += 2)
            marks[j] = false;
        for (int i = 1 ; i < wheels.length ; i++)
        {
            int p = wheels[i];
            for (int j = p * p ; j <= lim ; j += 2 * p)
                marks[j]=false;   // removing all integers that are NOT comprime with the base wheel primes
        }
        ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
        int d = nextp;
        for (int p = d + 2 ; p < marks.length ; p += 2)
        {
            if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
            {
                gs.add(p - d);
                d = p;
            }
        }
        gaps = new int [gs.size()];
        for (int i = 0 ; i < gs.size() ; i++)
            gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
        l = gaps.length;

        sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
    }

At the end of the preCalc method, the simpleSieve method is called, efficiently sieving all the sieving primes mentioned before, the primes <= sqrtx . 在preCalc方法结束时， preCalc simpleSieve方法，有效地筛选之前提到的所有筛分素数，素数<= sqrtx 。 This is a simple Eratosthenes sieve, rather than segmented, but it is still based on wheel factorization , perviously computed. 这是一个简单的Eratosthenes筛，而不是分段，但它仍然基于轮分解 ，可以通过计算。

 public static boolean [] simpleSieve (int l)
    {
        long sqrtl = (long)Math.sqrt(l);
        boolean [] primes = new boolean [l/2+2];
        Arrays.fill(primes, true);
        int g = -1;
        for (int i = nextp ; i <= sqrtl ; i += gaps[g])
        {
            if (primes[(i + 1) / 2])
                for (int j = i * i ; j <= l ; j += i * 2)
                    primes[(j + 1) / 2]=false;
            g++;
            if (g == l)
                g=0;
        }
        return primes;
    }

Finally, we reach the heart of the algorithm. 最后，我们到达了算法的核心。 We start by enumerating all primes <= sqrtx , with the following call: 我们首先枚举所有素数<= sqrtx ，并进行以下调用：

 long pi = pisqrtx();`

which used the following method: 使用以下方法：

public static long pisqrtx ()
    {
        int pi = wheels.length;
        if (x < wheels[wheels.length-1])
        {
            if (x < 2)
                return 0;
            int k = 0;
            while (wheels[k] <= x)
                k++;
            return k;
        }
        int g = -1;
        for (int i = nextp ; i <= sqrtx ; i += gaps[g])
        {
            if(sievingPrimes[( i + 1 ) / 2])
                pi++;
            g++;
            if (g == l)
                g=0;
        }

        return pi;
    }

Then, after initializing the pi variable which keeps track of the enumeration of primes, we perform the mentioned segmentation, starting the enumeration from the first prime > sqrtx : 然后，在初始化跟踪素数枚举的pi变量之后，我们执行上面提到的分段，从第一个prime > sqrtx开始枚举：

 int segSize = Math.max(sqrtx, 32768*8); //size of each segment
        long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
        int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes

        long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx

        for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
        {
            long high = Math.min(x, low + segSize);
            boolean [] segment = new boolean [(int) (high - low + 1)];

            int g = -1;
            for (int i = nextp ; i <= sqrtx ; i += gaps[g])
            { 
                if (sievingPrimes[(i + 1) / 2])
                {
                    long firstMultiple = (long) (low / i * i);
                    if (firstMultiple < low) 
                        firstMultiple += i; 
                    if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
                        firstMultiple += i;

                    for (long j = firstMultiple ; j < high ; j += i * 2) 
                        segment[(int) (j - low)] = true; 
                }
                g++;
                //if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant. 
                              //should be used with bigger segment sizes or smaller lists of gaps
                    //g = 0;
            }

            while (u <= high)
            {
                if (!segment[(int) (u - low)])
                    pi++;
                u += gaps[wh];
                wh++;
                if (wh == l)
                    wh = 0;
            }
        }

I have also included it as a note, but will explain as well. 我还把它作为一个注释包括在内，但也会解释。 Because the segment size is relatively small, we will not go through the entire list of gaps within just one segment, and checking it - is redundant. 由于段大小相对较小，我们不会在一个段内查看完整的间隙列表，并且检查它是多余的。 (Assuming we use a 19-wheel ). （假设我们使用的是19-wheel ）。 But in a broader scope overview of the program, we will make use of the entire array of gaps, so the variable u has to follow it and not accidentally surpass it: 但是在更广泛的范围概述中，我们将利用整个差距，因此变量u必须遵循它而不是意外地超越它：

 while (u <= high)
            {
                if (!segment[(int) (u - low)])
                    pi++;
                u += gaps[wh];
                wh++;
                if (wh == l)
                    wh = 0;
            }

Using higher limits will eventually render a bigger segment, which might result in a neccessity of checking we don't surpass the gaps list even within the segment. 使用更高的限制将最终呈现更大的片段，这可能导致检查的必要性，即使在片段内，我们也不会超过间隙列表。 This, or tweaking the wheel primes base might have this effect on the program. 这个，或调整wheel素数基数可能会对程序产生这种影响。 Switching to bit-sieving can largely improve the segment limit though. 切换到比特筛分可以在很大程度上改善分段限制。

As an important side-note, I am aware that efficient segmentation is one that takes the L1 & L2 cache-sizes into account. 作为一个重要的侧面说明，我知道有效的分段是考虑L1 & L2缓存大小的分段。 I get the fastest results using a segment size of 32,768 * 8 = 262,144 = 2^18 . 我使用段大小32,768 * 8 = 262,144 = 2^18获得最快的结果。 I am not sure what the cache-size of my computer is, but I do not think it can be that big, as I see most cache sizes <= 32,768 . 我不确定我的计算机的缓存大小是多少，但我认为它不会那么大，因为我看到大多数缓存大小<= 32,768 。 Still, this produces the fastest run time on my computer, so this is why it's the chosen segment size. 尽管如此，这会在我的计算机上产生最快的运行时间，因此这就是为什么它是所选的段大小。
As I mentioned, I am still looking to improve this by a lot. 正如我所提到的，我仍然希望通过大量改进来改善这一点。 I believe, according to my introduction, that multithreading can result in a speed-up factor of 4 , using 4 threads (corresponding to 4 cores). 我相信，根据我的介绍，多线程可以使用4个线程（对应4个内核）导致加速因子为4 。 The idea is that each thread will still use the idea of the segmented sieve, but work on different portions . 我们的想法是每个线程仍将使用分段筛的想法，但在不同的portions工作。 Divide the n into 4 equal portions - threads, each in turn performing the segmentation on the n/4 elements it is responsible for, using the above program. 将n分成4相等的部分 - 线程，每个线程依次使用上述程序对其负责的n/4元素执行分段。 My question is how do I do that? 我的问题是我该怎么做？ Reading about multithreading and examples, unfortunately, did not bring to me any insight on how to implement it in the case above efficiently. 不幸的是，阅读多线程和示例并没有带来任何有关如何在上述情况下有效实现它的见解。 It seems to me, as opposed to the logic behind it, that the threads were running sequentially, rather than simultaneously. 在我看来，与其背后的逻辑相反，线程是按顺序运行，而不是同时运行。 This is why I excluded it from the code to make it more readable. 这就是我将其从代码中排除以使其更具可读性的原因。 I will really appreciate a code sample on how to do it in this specific code, but a good explanation and reference will maybe do the trick too. 我将非常感谢有关如何在此特定代码中执行此操作的代码示例 ，但是一个很好的解释和参考也可能会起到作用。

Additionally, I would like to hear about more ways of speeding-up this program even more, any ideas you have, I would love to hear! 另外，我想了解更多关于加速这个项目的方法，你有任何想法，我很乐意听到！ Really want to make it very fast and efficient. 真的想让它变得非常快速和高效。 Thank you! 谢谢！

Answer 1

An example like this should help you get started. 这样的例子可以帮助您入门。

An outline of a solution: 解决方案概述：

Define a data structure ("Task") that encompasses a specific segment; 定义包含特定段的数据结构（“任务”）; you can put all the immutable shared data into it for extra neatness, too. 你可以将所有不可变的共享数据放入其中，以获得额外的整洁。 If you're careful enough, you can pass a common mutable array to all tasks, along with the segment limits, and only update the part of the array within these limits. 如果您足够小心，可以将常见的可变数组传递给所有任务以及段限制，并且只在这些限制内更新数组的部分。 This is more error-prone, but can simplify the step of joining the results (AFAICT; YMMV). 这更容易出错，但可以简化加入结果的步骤（AFAICT; YMMV）。
Define a data structure ("Result") that stores the result of a Task computation. 定义存储Task计算结果的数据结构（“Result”）。 Even if you just update a shared resulting structure, you may need to signal which part of that structure has been updated so far. 即使您只是更新共享的结果结构，您可能需要发信号通知到目前为止该结构的哪个部分已更新。
Create a Runnable that accepts a Task, runs a computation, and puts the results into a given result queue. 创建一个Runnable，它接受一个Task，运行一个计算，并将结果放入给定的结果队列中。
Create a blocking input queue for Tasks, and a queue for Results. 为Tasks创建阻塞输入队列，为Results创建队列。
Create a ThreadPoolExecutor with the number of threads close to the number of machine cores. 创建一个ThreadPoolExecutor，其线程数接近机器核心数。
Submit all your Tasks to the thread pool executor. 将所有任务提交给线程池执行程序。 They will be scheduled to run on the threads from the pool, and will put their results into the output queue, not necessarily in order. 它们将被安排在池中的线程上运行，并将其结果放入输出队列，而不一定按顺序。
Wait for all the tasks in the thread pool to finish. 等待线程池中的所有任务完成。
Drain the output queue and join the partial results into the final result. 排空输出队列并将部分结果加入到最终结果中。

Extra speedup may (or may not) be achieved by joining the results in a separate task that reads the output queue, or even by updating a mutable shared output structure under synchronized , depending on how much work the joining step involves. 通过将结果连接到读取输出队列的单独任务中，或者甚至通过在synchronized下更新可变共享输出结构，可以（或可以不）实现额外加速，这取决于加入步骤涉及多少工作。

Hope this helps. 希望这可以帮助。

Answer 2

Are you familiar with the work of Tomas Oliveira e Silva? 你熟悉Tomas Oliveira e Silva的作品吗？ He has a very fast implementation of the Sieve of Eratosthenes. 他对Eratosthenes筛子的实施速度非常快。

Answer 3

How interested in speed are you? 你对速度有多感兴趣？ Would you consider using c++? 你会考虑使用c ++吗？

$ time ../c_code/segmented_bit_sieve 1000000000
50847534 primes found.

real    0m0.875s
user    0m0.813s
sys     0m0.016s
$ time ../c_code/segmented_bit_isprime 1000000000
50847534 primes found.

real    0m0.816s
user    0m0.797s
sys     0m0.000s

(on my newish laptop with an i5) （在我的i5新款笔记本电脑上）

The first is from @Kim Walisch using a bit array of odd prime candidates. 第一个来自@Kim Walisch使用了一些奇数候选人。

https://github.com/kimwalisch/primesieve/wiki/Segmented-sieve-of-Eratosthenes https://github.com/kimwalisch/primesieve/wiki/Segmented-sieve-of-Eratosthenes

The second is my tweak to Kim's with IsPrime[] also implemented as bit array, which is slightly less clear to read, although a little faster for big N due to the reduced memory footprint. 第二个是我对Kim的调整，IsPrime []也实现为位数组，虽然读取稍微不那么清楚，但由于内存占用减少，大N的速度稍快一些。

I will read your post carefully as I am interested in primes and performance no matter what language is used. 我将仔细阅读您的帖子，因为无论使用何种语言，我都对素数和表现感兴趣。 I hope this isn't too far off topic or premature. 我希望这不是太偏离主题或过早。 But I noticed I was already beyond your performance goal. 但我注意到我已经超出了你的表现目标。

Java中Eratosthenes的多线程分段筛

问题描述

3 个解决方案

解决方案1
1 2019-07-25 14:54:54

解决方案2
1 2019-07-25 16:59:08

解决方案3
0 2019-08-22 16:45:04

Java中Eratosthenes的多线程分段筛

问题描述

3 个解决方案

解决方案1 1 2019-07-25 14:54:54

解决方案2 1 2019-07-25 16:59:08

解决方案3 0 2019-08-22 16:45:04

解决方案1
1 2019-07-25 14:54:54

解决方案2
1 2019-07-25 16:59:08

解决方案3
0 2019-08-22 16:45:04