简体   繁体   中英

What exactly does this random.uniform line in Python do?

I'm following a tutorial here from Andrew Cross on using random forests in Python. I got the code to run fine, and for the most part I understand the output. However, I am unsure on exactly what this line does:

df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75

I know that it "creates a (random) uniform distribution between 0 and 1 and assigns 3/4ths of the data to be in the training subset." However, the training subset is not always exactly 3/4 of the subset. Sometimes it is smaller and sometimes it is larger. So is a random sized subset chosen that is approximately 75%? Why not make it always 75%?

It does not assign 3/4 ths of the data to be in the training subset.
It assigns the probability that data is in the training subset to be 3/4 :

Example:

>>> import numpy as np
>>> sum(np.random.uniform(0, 1, 10) < .75)
8
>>> sum(np.random.uniform(0, 1, 10) < .75)
10
>>> sum(np.random.uniform(0, 1, 10) < .75)
7
  • 80% of the data is in the training subset in the 1st example
  • 100% -- in the 2nd one
  • 70% -- in the 3rd.

On average, it should be 75%.

np.random.uniform(0, 1, len(df)) creates an array of len(df) random numbers.
<= .75 then creates another array containing True where the numbers matched that condition, and False in other places.
The code then uses the data in indexes where True was found. Since the random distribution is... well, random, you won't get exactly 75% of the values.

If you want to be more strict selecting randomly a training set always very near to 75%, you can use some code like this:

d = np.random.uniform(0, 1, 1000)
p = np.percentile(d, 75)

print(np.sum(d <= p))   # 750
print(np.sum(d <= .75)) # 745

In your example:

d = np.random.uniform(0, 1, len(df))
p = np.percentile(d, 75)
df['is_train'] = d <= p

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM