C++ Optimizing this Algorithm

Question

After watching some Terence Tao videos, I wanted to try implementing algorithms into c++ code to find all the prime numbers up to a number n. In my first version, where I simply had every integer from 2 to n tested to see if they were divisible by anything from 2 to sqrt(n), I got the program to find the primes between 1-10,000,000 in ~52 seconds.

Attempting to optimize the program, and implementing what I now know to be the Sieve of Eratosthenes, I assumed the task would be done much faster than 51 seconds, but sadly, that wasn't the case. Even going up to 1,000,000 took a considerable amount of time (didn't time it, though)

#include <iostream>
#include <vector>
using namespace std;

void main()
{
    vector<int> tosieve = {};        
    for (int i = 2; i < 1000001; i++) 
    {                                       
        tosieve.push_back(i);               
    }                                       
        for (int j = 0; j < tosieve.size(); j++)
        {
            for (int k = j + 1; k < tosieve.size(); k++)
            {
                if (tosieve[k] % tosieve[j] == 0)
                {
                    tosieve.erase(tosieve.begin() + k);
                }
            }
        }
    //for (int f = 0; f < tosieve.size(); f++)
    //{
    //  cout << (tosieve[f]) << endl;
    //}
    cout << (tosieve.size()) << endl;
    system("pause");
}

Is it the repeated referencing of the vectors or something? Why is this so slow? Even if I'm completely overlooking something (could be, complete beginner at this:I) I would think that finding the primes between 2 and 1,000,000 with this horrible inefficient method would be faster than my original way of finding them from 2 to 10,000,000.

Hope someone has a clear answer to this - hopefully I can use whatever knowledge is gleaned in the future when optimizing programs using a lot of recursion.

Answer 1

The problem is that 'erase' moves every element in the vector down one, meaning it is an O(n) operation.

There are three alternative choices:

1) Just mark deleted elements as 'empty' (make them 0, for example). This will mean future passes have to pass over those empty positions, but that isn't that expensive.

2) Make a new vector, and push_back new values into there.

3) Use std::remove_if: This will move the elements down, but do it in a single pass so will be more efficient. If you use std::remove_if, then you will have to remember it doesn't resize the vector itself.

Answer 2

Most of vector operations, including erase() have a O(n) linear time complexity.

Since you have two loops of size 10^6 , and a vector of size 10^6 , your algorithm executes up to 10^18 operations.

Qubic algorithms for such a big N will take a huge amount of time.
N = 10^6 is even big enough for quadratic algorithms.

Please, read carefully about Sieve of Eratosthenes . The fact that both full search and Sieve of Eratosthenes algorithms took the same time, means that you have done the second one wrong.

Answer 3

I see two performanse issues here:

First of all, push_back() will have to reallocate the dynamic memory block once in a while. Use reserve() :

vector<int> tosieve = {};
tosieve.resreve(1000001);       
for (int i = 2; i < 1000001; i++) 
{                                       
    tosieve.push_back(i);               
}

Second erase() has to move all Elements behind the one you try to remove. You set the elements to 0 instead and do a run over the vector in the end (untested code):

for (auto& x : tosieve) {
    for (auto y = tosieve.begin(); *y < x; ++y) // this check works only in
                                                // the case of an ordered vector
        if (y != 0 && x % y == 0) x = 0;
}
{ // this block will make sure, that sieved will be released afterwards
    auto sieved = vector<int>{};
    for(auto x : tosieve)
        sieved.push_back(x);
    swap(tosieve, sieved);
} // the large memory block is released now, just keep the sieved elements.

consider to use standard algorithms instead of hand written loops. They help you to state your intent. In this case I see std::transform() for the outer loop of the sieve, std::any_of() for the inner loop, std::generate_n() for filling tosieve at the beginning and std::copy_if() for filling sieved (untested code):

vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
    static int i = 2; return i++;
});

transform(begin(tosieve), end(tosieve), begin(tosieve), [](int i) -> int {
    return any_of(begin(tosieve), begin(tosieve) + i - 2,
                  [&i](int j) -> bool {
                      return j != 0 && i % j == 0;
                  }) ? 0 : i;
});
swap(tosieve, [&tosieve]() -> vector<int> {
    auto sieved = vector<int>{};
    copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
            [](int i) -> bool { return i != 0; });
    return sieved;
});

EDIT:

Yet another way to get that done:

vector<int> tosieve = {};
tosieve.resreve(1000001);
generate_n(back_inserter(tosieve), 1000001, []() -> int {
    static int i = 2; return i++;
});
swap(tosieve, [&tosieve]() -> vector<int> {
    auto sieved = vector<int>{};
    copy_if(begin(tosieve), end(tosieve), back_inserter(sieved),
            [](int i) -> bool {
                return !any_of(begin(tosieve), begin(tosieve) + i - 2,
                               [&i](int j) -> bool {
                                   return i % j == 0;
                               });
            });
    return sieved;
});

Now instead of marking elements, we don't want to copy afterwards, but just directly copy only the elements, we want to copy. This is not only faster than the above suggestion, but also better states the intent.

Answer 4

Very interesting task you have. Thanks!

With pleasure I implemented from scratch my own versions of solving it.

I created 3 separate (independent) functions, all based on Sieve of Eratosthenes . These 3 versions are different in their complexity and speed.

Just a quick note, my simplest (slowest) version finds all primes below your desired limit of 10'000'000 within just 0.025 sec (ie 25 milli-seconds).

I also tested all 3 versions to find primes below 2^32 ( 4'294'967'296 ), which is solved by "simple" version within 47 seconds, by "intermediate" version within 30 seconds, by "advanced" within 12 seconds. So within just 12 seconds it finds all primes below 4 Billion (there are 203'280'221 such primes below 2^32, see OEIS sequence )!!!

For simplicity I will describe in details only Simple version out of 3. Here's code:

template <typename T>
std::vector<T> GenPrimes_SieveOfEratosthenes(size_t end) {
    // https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
    if (end <= 2)
        return {};
    size_t const cnt = end >> 1;
    std::vector<u8> composites((cnt + 7) / 8);
    auto Get = [&](size_t i){ return bool((composites[i / 8] >> (i % 8)) & 1); };
    auto Set = [&](size_t i){ composites[i / 8] |= u8(1) << (i % 8); };
    std::vector<T> primes = {2};
    size_t i = 0;
    for (i = 1; i < cnt; ++i) {
        if (Get(i))
            continue;
        size_t const p = 2 * i + 1, start = (p * p) >> 1;
        primes.push_back(p);
        if (start >= cnt)
            break;
        for (size_t j = start; j < cnt; j += p)
            Set(j);
    }
    for (i = i + 1; i < cnt; ++i)
        if (!Get(i))
            primes.push_back(2 * i + 1);
    return primes;
}

This code implements simplest but fast algorithm of finding primes, called Sieve of Eratosthenes. As a small optimization of speed and memory, I search only over odd numbers. This odd numbers optimization gives me ability to store 2x times less memory and do 2x times less steps, hence improves both speed and memory consumption exactly 2 times.

Algorithm is simple, we allocate array of bits, this array at position K has bit 1 if K is composite, or has 0 if K is probably prime. At the end all 0 bits in array signify Definite primes (that are for sure primes). Also due to odd numbers optimization this bit-array stores only odd numbers, so K-th bit is actually a number 2 * K + 1 .

Then left to right we go over this array of bits and if we meet 0 bit at position K then it means we found a prime number P = 2 * K + 1 and now starting from position (P * P) / 2 we mark every P-th bit with 1. It means we mark all numbers bigger than P*P that are composite, because they are divisible by P.

We do this procedure only until P * P becomes greater or equal to our limit End (we're finding all primes < End). This limit guarantees that after reaching it ALL zero bits inside array signify prime numbers.

Second version of code does only one optimization to this Simple version, it makes all multi-core (multi-threaded). But this only optimization makes code much bigger and more complex. Basically it slices whole range of bits into all cores, so that they write bits to memory in parallel.

I'll explain only my third Advanced version, it is most complex of 3 versions. It does not only multi-threaded optimization, but also so-called Primorial optimization.

What is Primorial , it is a product of first smallest primes, for example I take primorial 2 * 3 * 5 * 7 = 210 .

We can see that any primorial splits infinite range of integers into wheels by modulus of this primorial. For example primorial 210 splits into ranges [0; 210), [210; 2 210), [2 210; 3*210), etc.

Now it is easy to mathematically prove that inside All ranges of primorial we can mark same positions of numbers as complex, exactly we can mark all numbers that are multiple of 2 or 3 or 5 or 7 as composite.

We can see that out of 210 remainders there are 162 remainders that are for sure composite, and only 48 remainders are probably prime.

Hence it is enough for us to check primality of only 48/210=22.8% of whole search space. This reduction of search space makes task more than 4x times faster, and 4x times less memory consuming.

One can see that my first Simple version in fact due to odd-only optimization was actually using Primorial equal to 2 optimization. Yes, if we take primorial 2 instead of primorial 210, then we gain exactly first version (Simple) algorithm.

All of my 3 versions are tested for correctness and speed. Although still some tiny bugs can remain. Note . Yet it is recommended not to use my code straight away in production, unless it is tested thoroughly.

All 3 versions are tested for correctness by re-using each other answers. I thoroughly test correctness by feeding all limits ( end value) from 0 to 2^18. It takes some time to do this.

See main() function to figure out how to use my functions.

Try it online!

SOURCE CODE GOES HERE . Due to StackOverflow limit of 30K symbols per post, I can't inline source code here, as it is almost 30K in size and together with English post above it takes more than 30K. So I'm providing source code on separate Github Gist server, link below. Note that Try it online! link above also contains full source code, but I reduced search limit of 2^32 to smaller one due to GodBolt limit of running time to 3 seconds.

Github Gist code

Output:

10M time 'Simple' 0.024 sec
Time 2^32 'Simple' 46.924 sec, number of primes 203280221
Time 2^32 'Intermediate' 30.999 sec
Time 2^32 'Advanced' 11.359 sec
All checked till 0
All checked till 5000
All checked till 10000
All checked till 15000
All checked till 20000
All checked till 25000

C++ Optimizing this Algorithm

Question

4 answers

solution1
4 2015-11-03 10:27:48

solution2
3 2015-11-03 10:25:09

solution3
2 2015-11-03 11:09:48

solution4
0 2022-09-14 19:14:00

C++ Optimizing this Algorithm

Question

4 answers

solution1 4 2015-11-03 10:27:48

solution2 3 2015-11-03 10:25:09

solution3 2 2015-11-03 11:09:48

solution4 0 2022-09-14 19:14:00

solution1
4 2015-11-03 10:27:48

solution2
3 2015-11-03 10:25:09

solution3
2 2015-11-03 11:09:48

solution4
0 2022-09-14 19:14:00