Take random samples from the data with different number each time

Question

I have a pandas dataframe that I want to randomly pick samples from it. The first time I want to pick 10, then 20, 30, 40, and 50 random samples (without replacment). I'm trying to do it with a for loop, altough I don't know how good this is cause a list can't contain data frames, right? (my coding is better with R and there the lists can contain dataframes).

number = [10,20,30,40,50]
sample = []
for i in range(len(number)):
    sample[i].append(data.sample(n = number[i]))

And the error is IndexError: list index out of range

I dont want to copy past the code so what is the right way to do it?

Answer 1

You could do that using radint method for choosing random element from the list number :

import random    
number = [10,20,30,40,50]
sample = []
for i in range(len(number)):
    sample.append(data.sample(n = number[random.randint(0, len(number)-1]))

Update:

Assuming you have this dataframe for Movies Rating dataset:

data = [['avengers', 5.4 ,'PG-13'],
['captain america', 6.7, 'PG-13'],
['spiderman', 7,    'R'],
['daredevil', 8.2, 'R'],
['iron man', 8.6, 'PG-13'],
['deadpool', 10, 'R']]

df = pd.DataFrame(data, columns=['title', 'score', 'rating'])

You can take random samples from it using sample method:

# taking random 3 records from dataframe
samples = df.sample(3)

Output:

             title  score rating
1  captain america    6.7  PG-13
5         deadpool   10.0      R
3        daredevil    8.2      R

Another execution:

       title  score rating
4   iron man    8.6  PG-13
0   avengers    5.4  PG-13
2  spiderman    7.0      R

Also you can randomize the number of samples according to your dataframe # of rows:

df.sample(random.randint(1, len(df)))

Alternate Approach:

If you want you could write your own function for generating random samples from dataframe in this way:

import random   
def generate_rand_sample(df):
    start_i = end_i = 0
    while end_i == start_i:
        start_i = random.randint(0, len(df) - 1)
        end_i = random.randint(start_i, len(df))
    return df.iloc[start_i:end_i]

generate_rand_sample(df)

First Run:

             title  score rating
1  captain america    6.7  PG-13
2        spiderman    7.0      R

Second Run:

      title  score rating
2  spiderman    7.0      R
3  daredevil    8.2      R
4   iron man    8.6  PG-13
5   deadpool   10.0      R

Answer 2

Try range(len(number)-1). The reason is for loop starts from 0 to n. So in this case it will start from 0 then till 5. Which makes a total of 6 loops (0,1,2,3,4,5). That's why your list goes out of range

Take random samples from the data with different number each time

Question

2 answers

solution1
0 2022-11-27 15:55:22

Update:

Alternate Approach:

solution2
0 2022-11-27 15:56:41

Take random samples from the data with different number each time

Question

2 answers

solution1 0 2022-11-27 15:55:22

Update:

Alternate Approach:

solution2 0 2022-11-27 15:56:41

solution1
0 2022-11-27 15:55:22

solution2
0 2022-11-27 15:56:41