What am i doing wrong with these random numbers?

Question

I've been told that rand() mod n produces biased results, so i tried to make this code to check it. It generates s numbers from 1 to l and than sorts by occurrences.

#include <iostream>
#include <random>

using namespace std;

struct vec_struct{
    int num;
    int count;
    double ratio;
};

void num_sort(vec_struct v[], int n){
    for (int i = 0; i < n-1; i++){
        for (int k = 0; k < n-1-i; k++){
            if (v[k].num > v[k+1].num) swap(v[k], v[k+1]);
        }
    }
}

void count_sort(vec_struct v[], int n){
    for (int i = 0; i < n-1; i++){
        for (int k = 0; k < n-1-i; k++){
            if (v[k].count < v[k+1].count) swap(v[k], v[k+1]);
        }
    }
}

int main(){

    srand(time(0));

    random_device rnd;

    int s, l, b, c = 1;

    cout << "How many numbers to generate? ";
    cin >> s;

    cout << "Generate " << s << " numbers ranging from 1 to? ";
    cin >> l;

    cout << "Use rand or mt19937? [1/2] ";
    cin >> b;

    vec_struct * vec = new vec_struct[s];

    mt19937 engine(rnd());
    uniform_int_distribution <int> dist(1, l);

    if (b == 1){
        for (int i = 0; i < s; i++){
            vec[i].num = (rand() % l) + 1;
        }
    } else if (b == 2){
        for (int i = 0; i < s; i++){
            vec[i].num = dist(engine);
        }   
    }
    num_sort(vec, s);

    for (int i = 0, j = 0; i < s; i++){
        if (vec[i].num == vec[i+1].num){
            c++;
        } else {
            vec[j].num = vec[i].num;
            vec[j].count = c;
            vec[j].ratio = ((double)c/s)*100;
            j++;
            c = 1;  
        }
    }
    count_sort(vec, l);

    if (l >= 20){

        cout << endl << "Showing the 10 most common numbers" << endl;
        for (int i = 0; i < 10; i++){
            cout << vec[i].num << "\t" << vec[i].count << "\t" << vec[i].ratio << "%" << endl;
        }

        cout << endl << "Showing the 10 least common numbers" << endl;
        for (int i = l-10; i < l; i++){
            cout << vec[i].num << "\t" << vec[i].count << "\t" << vec[i].ratio << "%" << endl;
        }
    } else {

        for (int i = 0; i < l; i++){
            cout << vec[i].num << "\t" << vec[i].count << "\t" << vec[i].ratio << "%" << endl;
        }
    }
}

After running this code I can spot the expected bias from rand():

$ ./rnd_test 
How many numbers to generate? 10000
Generate 10000 numbers ranging from 1 to? 50
Use rand or mt19937? [1/2] 1

Showing the 10 most common numbers
17  230 2.3%
32  227 2.27%
26  225 2.25%
25  222 2.22%
3   221 2.21%
10  220 2.2%
35  218 2.18%
5   217 2.17%
13  215 2.15%
12  213 2.13%

Showing the 10 least common numbers
40  187 1.87%
7   186 1.86%
39  185 1.85%
42  184 1.84%
43  184 1.84%
34  182 1.82%
21  175 1.75%
22  175 1.75%
18  173 1.73%
44  164 1.64%

Hoover i'm getting pretty much the same result with mt19937 and uniform_int_distribution ! What's wrong here? Shouldn't be uniform, or the test is useless?

Answer 1

No, it should not be perfectly uniform. Thus the above is not evidence of any error.

They are random and thus it should be fairly uniform, but not exactly.

In particular you would expect each number to occur about 10000/50=200 times - roughly with a standard deviation of sqrt(200) which is about 14 - and for 50 numbers you would expect about 2 standard deviations of difference - which is +-/28.

The bias caused by using modulus for RAND_MAX is smaller than that; so you would need a lot more samples to detect the bias.

Answer 2

You have to use more samples for such random number tests. I tried 50000 with your code, and the result is:

How many numbers to generate? 50000

Generate 50000 numbers ranging from 1 to? 50

Use rand or mt19937? [1/2] 2

Showing the 10 most common numbers

36 1054 2.108%

14 1051 2.102%

11 1048 2.096%

27 1045 2.09%

2 1044 2.088%

33 1035 2.07%

21 1034 2.068%

48 1034 2.068%

34 1030 2.06%

39 1030 2.06%

Showing the 10 least common numbers

47 966 1.932%

16 961 1.922%

38 960 1.92%

28 959 1.918%

8 958 1.916%

10 958 1.916%

30 958 1.916%

32 958 1.916%

18 953 1.906%

23 953 1.906%

Answer 3

As far as I can tell from http://www.cplusplus.com/reference/random/mersenne_twister_engine/ mt19937 will suffer from the same bias as rand()

The bias is due rand() generating an unsigned integer in some range [0-MAX_RAND], when you take the modulus it makes smaller numbers slightly more likely (unless your divisor is an integer divisor of MAX_RAND)

Consider:

Range [0-74]:
0 % 50 = 0
40 % 50 = 40
50 % 50 = 0
74 % 50 = 24
(numbers less than 25 occur twice)

What am i doing wrong with these random numbers?

Question

3 answers

solution1
1 2016-11-08 08:34:22

solution2
0 2016-11-08 10:29:43

solution3
-1 2016-11-08 08:35:23

What am i doing wrong with these random numbers?

Question

3 answers

solution1 1 2016-11-08 08:34:22

solution2 0 2016-11-08 10:29:43

solution3 -1 2016-11-08 08:35:23

solution1
1 2016-11-08 08:34:22

solution2
0 2016-11-08 10:29:43

solution3
-1 2016-11-08 08:35:23