简体   繁体   中英

Take random samples from the data with different number each time

I have a pandas dataframe that I want to randomly pick samples from it. The first time I want to pick 10, then 20, 30, 40, and 50 random samples (without replacment). I'm trying to do it with a for loop, altough I don't know how good this is cause a list can't contain data frames, right? (my coding is better with R and there the lists can contain dataframes).

number = [10,20,30,40,50]
sample = []
for i in range(len(number)):
    sample[i].append(data.sample(n = number[i]))

And the error is IndexError: list index out of range

I dont want to copy past the code so what is the right way to do it?

You could do that using radint method for choosing random element from the list number :

import random    
number = [10,20,30,40,50]
sample = []
for i in range(len(number)):
    sample.append(data.sample(n = number[random.randint(0, len(number)-1]))

Update:

Assuming you have this dataframe for Movies Rating dataset:

data = [['avengers', 5.4 ,'PG-13'],
['captain america', 6.7, 'PG-13'],
['spiderman', 7,    'R'],
['daredevil', 8.2, 'R'],
['iron man', 8.6, 'PG-13'],
['deadpool', 10, 'R']]

df = pd.DataFrame(data, columns=['title', 'score', 'rating'])

You can take random samples from it using sample method:

# taking random 3 records from dataframe
samples = df.sample(3)

Output:

             title  score rating
1  captain america    6.7  PG-13
5         deadpool   10.0      R
3        daredevil    8.2      R

Another execution:

       title  score rating
4   iron man    8.6  PG-13
0   avengers    5.4  PG-13
2  spiderman    7.0      R

Also you can randomize the number of samples according to your dataframe # of rows:

df.sample(random.randint(1, len(df)))

Alternate Approach:

If you want you could write your own function for generating random samples from dataframe in this way:

import random   
def generate_rand_sample(df):
    start_i = end_i = 0
    while end_i == start_i:
        start_i = random.randint(0, len(df) - 1)
        end_i = random.randint(start_i, len(df))
    return df.iloc[start_i:end_i]

generate_rand_sample(df)

First Run:

             title  score rating
1  captain america    6.7  PG-13
2        spiderman    7.0      R

Second Run:

      title  score rating
2  spiderman    7.0      R
3  daredevil    8.2      R
4   iron man    8.6  PG-13
5   deadpool   10.0      R

Try range(len(number)-1). The reason is for loop starts from 0 to n. So in this case it will start from 0 then till 5. Which makes a total of 6 loops (0,1,2,3,4,5). That's why your list goes out of range

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM