简体   繁体   中英

How to randomly select fixed number of rows (if greater) per group else select all rows in pandas?

Example Dataframe:

    Name Group_Id
    AAA  1
    ABC  1
    BDF  1
    CCC  2
    XYZ  2
    DEF  3 

How could I randomly select fixed number of rows for each Group_Id ? This answer suggests a method to use:

df.groupby('Group_Id').apply(lambda x: x.sample(2)).reset_index(drop=True)

But it throws an error if there is any group which has less than 2 rows. I want to be able to select all rows in that case. .head() allows to do that but I want random samples and not the initial rows.

Say that I want max two random draws per Group_Id , I would get:

    Name Group_Id
    AAA  1
    BDF  1
    CCC  2
    XYZ  2
    DEF  3

You can choose to sample only if you have more row:

n = 2
(df.groupby('Group_Id')
   .apply(lambda x: x.sample(n) if len(x)>n else x  )
   .reset_index(drop=True)
)

You can also try shuffling the whole data and groupby().head() :

df.sample(frac=1).groupby('Group_Id').head(2)

Output:

  Name  Group_Id
5  DEF         3
0  AAA         1
2  BDF         1
3  CCC         2
4  XYZ         2

You can shuffle each subgroup and take the first n rows. It will automatically take the min of n or actual.

n=2
df2 = df.groupby('Group_Id').apply(lambda x: x.sample(frac=1)[:n]).reset_index(drop=True)
      

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM