简体   繁体   中英

np.random.rand() or random.random()

While analyzing a code, I've stumbled upon the following snippet:

msk = np.random.rand(len(df)) < 0.8

Variables "msk" and "df" are irrelevant for my question. After doing some research I think this usage is also related to "random" class as well. It gives True with 80% chance and False with 20% chance on random elements. It is done for masking. I understand why it is used but I don't understand how it works. Isn't random method supposed to give float numbers? Why are there boolean statements when we put the method in an interval?

np.random.rand(len(df)) returns an array of uniform random numbers between 0 and 1, np.random.rand(len(df)) < 0.8 will transform it into an array of booleans based on the condition.

As there is a 80% chance to be below 0.8, there is 80% of True values.

A more explicit approach would be to use numpy.random.choice :

np.random.choice([True, False], p=[0.8, 0.2], size=len(df))

An even better approach, if your goal is to subset a dataframe, would be to use:

df.sample(frac=0.8)

how to split a dataframe 0.8/0.2:

df1 = df.sample(frac=0.8)
df2 = df.drop(df1.index)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM