简体   繁体   中英

Why do digits 1, 2 and 3 appear so frequently using C rand() function?

What I am trying to do is to generate some random numbers (not necessarily single digit) like

29106
7438
5646
4487
9374
28671
92
13941
25226
10076

and then count the number of digits I get:

count[0] =       3  Percentage =  6.82
count[1] =       5  Percentage = 11.36
count[2] =       6  Percentage = 13.64
count[3] =       3  Percentage =  6.82
count[4] =       6  Percentage = 13.64
count[5] =       2  Percentage =  4.55
count[6] =       7  Percentage = 15.91
count[7] =       5  Percentage = 11.36
count[8] =       3  Percentage =  6.82
count[9] =       4  Percentage =  9.09

This is the code I am using:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int main() {

    int i;
    srand(time(NULL));
    FILE* fp = fopen("random.txt", "w");    
    // for(i = 0; i < 10; i++)
    for(i = 0; i < 1000000; i++)
        fprintf(fp, "%d\n", rand());
    fclose(fp);

    int dummy;
    long count[10] = {0,0,0,0,0,0,0,0,0,0};
    fp = fopen("random.txt", "r");
    while(!feof(fp)) {
        fscanf(fp, "%1d", &dummy);
        count[dummy]++;                 
    }
    fclose(fp);

    long sum = 0;
    for(i = 0; i < 10; i++)
        sum += count[i];

    for(i = 0; i < 10; i++)
        printf("count[%d] = %7ld  Percentage = %5.2f\n",
            i, count[i], ((float)(100 * count[i])/sum));

}

If I generate a large number of random numbers (1000000), this is the result I get:

count[0] =  387432  Percentage =  8.31
count[1] =  728339  Percentage = 15.63
count[2] =  720880  Percentage = 15.47
count[3] =  475982  Percentage = 10.21
count[4] =  392678  Percentage =  8.43
count[5] =  392683  Percentage =  8.43
count[6] =  392456  Percentage =  8.42
count[7] =  391599  Percentage =  8.40
count[8] =  388795  Percentage =  8.34
count[9] =  389501  Percentage =  8.36

Notice that 1, 2 and 3 have too many hits. I have tried running this several times and each time I get very similar results.

I am trying to understand what could cause 1, 2 and 3 to appear much more frequently than any other digit.


Taking hint from what Matt Joiner and Pascal Cuoq pointed out,

I changed the code to use

for(i = 0; i < 1000000; i++)
    fprintf(fp, "%04d\n", rand() % 10000);
// pretty prints 0
// generates numbers in range 0000 to 9999

and this is what I get (similar results on multiple runs):

count[0] =  422947  Percentage = 10.57
count[1] =  423222  Percentage = 10.58
count[2] =  414699  Percentage = 10.37
count[3] =  391604  Percentage =  9.79
count[4] =  392640  Percentage =  9.82
count[5] =  392928  Percentage =  9.82
count[6] =  392737  Percentage =  9.82
count[7] =  392634  Percentage =  9.82
count[8] =  388238  Percentage =  9.71
count[9] =  388352  Percentage =  9.71

What can be the reason that 0, 1 and 2 are favored?


Thanks everyone. Using

int rand2(){
    int num = rand();
    return (num > 30000? rand2():num);     
}

    fprintf(fp, "%04d\n", rand2() % 10000);

I get

count[0] =  399629  Percentage =  9.99
count[1] =  399897  Percentage = 10.00
count[2] =  400162  Percentage = 10.00
count[3] =  400412  Percentage = 10.01
count[4] =  399863  Percentage = 10.00
count[5] =  400756  Percentage = 10.02
count[6] =  399980  Percentage = 10.00
count[7] =  400055  Percentage = 10.00
count[8] =  399143  Percentage =  9.98
count[9] =  400104  Percentage = 10.00

rand() generates a value from 0 to RAND_MAX . RAND_MAX is set to INT_MAX on most platforms, which may be 32767 or 2147483647 .

For your example given above, it appears that RAND_MAX is 32767 . This will place an unusually high frequency of 1 , 2 and 3 for the most significant digit for the values from 10000 to 32767 . You can observe that to a lesser degree, values up to 6 and 7 will also be slightly favored.

Regarding the edited question,

This is because the digits are still not uniformly distributed even if you % 10000 . Assume RAND_MAX == 32767 , and rand() is perfectly uniform.

For every 10,000 numbers counting from 0, all of the digits will appear uniformly (4,000 each). However, 32,767 is not divisible by 10,000. Therefore, these 2,768 numbers will provide more leading 0, 1 and 2's to the final count.

The exact contribution from these 2,768 numbers are:

digits count
0      1857
1      1857
2      1625
3      857
4      857
5      857
6      855
7      815
8      746
9      746

adding 12,000 for the initial 30,000 numbers to the count, then divide by the total number of digits (4×32,768) should give you the expected distribution:

number  probability (%)
0       10.5721
1       10.5721
2       10.3951
3        9.80911
4        9.80911
5        9.80911
6        9.80759
7        9.77707
8        9.72443
9        9.72443

which is close to what you get.

If you want to truly uniform digit distribution, you need to reject those 2,768 numbers:

int rand_4digits() {
  const int RAND_MAX_4_DIGITS = RAND_MAX - RAND_MAX % 10000;
  int res;
  do {
    res = rand();
  } while (res >= RAND_MAX_4_DIGITS);
  return res % 10000;
}

看起来像本福德定律 - 见http://en.wikipedia.org/wiki/Benford%27s_law ,或者是一个不太好的RNG。

That's because you generate numbers between 0 and RAND_MAX . The generated numbers are evenly distributed (ie approx. same probability for each number), however, the digits 1,2,3 occur more often than others in this range. Try generating between 0 and 10 , where each digit occurs with the same probability and you'll get a nice distribution.

If I understand what the OP (person asking the question) wants, they want to make better random numbers.

rand() and random(), quite frankly, don't make very good random numbers; they both do poorly when tested against diehard and dieharder (two packages for testing the quality of random numbers).

The Mersenne twister is a popular random number generator which is good for pretty much everything except crypto-strong random numbers; it passes all of the diehard(er) tests with flying colors.

If one needs crypto-strong random numbers (numbers that can not be guessed, even if someone knows which particular crypto-strong algorithm is being used), there are a number of stream ciphers out there. The one I like to use is called RadioGatún[32], and here's a compact C representation of it:

/*Placed in the public domain by Sam Trenholme*/
#include <stdint.h>
#include <stdio.h> 
#define p uint32_t
#define f(a) for(c=0;c<a;c++)
#define n f(3){b[c*13]^=s[c];a[16+c]^=s[c];}k(a,b 
k(p *a,p *b){p A[19],x,y,r,q[3],c,i;f(3){q[c]=b[c
*13+12];}for(i=12;i;i--){f(3){b[c*13+i]=b[c*13+i- 
1];}}f(3){b[c*13]=q[c];}f(12){i=c+1+((c%3)*13);b[
i]^=a[c+1];}f(19){y=(c*7)%19;r=((c*c+c)/2)%32;x=a
[y]^(a[(y+1)%19]|(~a[(y+2)%19]));A[c]=(x>>r)|(x<<
(32-r));}f(19){a[c]=A[c]^A[(c+1)%19]^A[(c+4)%19];
}a[0]^=1;f(3){a[c+13]^=q[c];}}l(p *a,p *b,char *v
){p s[3],q,c,r,x,d=0;for(;;){f(3){s[c]=0;}for(r=0
;r<3;r++){for(q=0;q<4;q++){if(!(x=*v&255)){d=x=1;
}v++;s[r]|=x<<(q*8);if(d){n);return;}}}n);}}main(
int j,char **h){p a[39],b[39],c,e,g;if(j==2){f(39
){a[c]=b[c]=0;}l(a,b,h[1]);f(16){k(a,b);}f(4){k(a
,b);for(j=1;j<3;++j){g=a[j];for(e=4;e;e--){printf
("%02x",g&255);g>>=8;}}}printf("\n");}}

There are also a lot of other really good random number generators out there.

When you want to generate random value from range [0, x) , instead of doing rand()%x , you should apply formula x*((double)rand()/RAND_MAX) , which will give you nicely distributed random values.

Say, RAND_MAX is equal to 15, so rand will give you integers from 0 to 15. When you use modulo operator to get random numbers from [0, 10) , values [0,5] will have higher frequency than [6,9] , because 3 == 3%10 == 13%10 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM