简体   繁体   中英

How to remove *some* rows based on a given condition pandas/python

I am working with a dataset in Pandas and I want to remove some rows based on a given condition. I have a column in my dataset that is the number of comorbidities a participant has, the possible values are 0, 1, 2, 3. The dataset has roughly 1 million rows (and 30 other columns), with about 500k participants = 0 comorbidity, about 300K participants = 1 comorbidity, about 130K participants = 2 comorbidities, and about 75k participants = 3 comorbidities. I want to randomly drop groups of participants based on their comorbidities value, for example, drop 200k with 0 comorbidities, 100k with 1 comorbidity. I know if wanted to drop all participants with a given number of comorbidities, for example all participants with 0 comorbidities I could do the following:

dataframe = allpart, column name = CM

allpart.drop(allpart[allpart['CM'] == 0].index, inplace = True) 

How could I change this so that it would randomly select 300k rows w/ 0 comorbidities? My data frame is not ordered in ascending order by that column so that rules out dropping a chunk of rows also I am not sure that would be random enough. I also want to mention that I will not be using this to draw any legitimate conclusions from this, it is solely just for my own interest.

Thank you!

One solution would be to define how many rows you want to keep for each comorbidity and then groupby + sample to select a random subset of that size.

I added a small check in case you specify a number of rows that is larger than the number of unique rows that exist in your DataFrame for that 'CM' group. In this case it just returns all rows.

import pandas as pd
import numpy as np
np.random.seed(410112)

df = pd.DataFrame({'id': range(20), 'CM': np.random.choice([0,1,2,3,4], 20)})
# Keys is comorbidity index, value is # of rows to keep 
d = {0: 1, 1: 3, 2: 2, 3: 20, 4: 2}

l = []
for idx, gp in df.groupby('CM'):
    try:
        gp = gp.sample(n=d[idx], replace=False)
    # If try to subsample more people than exist, do nothing
    except ValueError:
        pass 
    l.append(gp)
    
df1 = pd.concat(l)

    id  CM
3    3   0
17  17   1
13  13   1
5    5   1
19  19   2
7    7   2
1    1   3
4    4   3
10  10   3
12  12   4
0    0   4

Another alternative which is similar but doesn't require reconstructing the entire DataFrame (so likely faster) would be to again specify a dictionary d of the number of rows to keep and use sample(frac=1) to shuffle the DataFrame, then groupby + cumcount to keep a random subset of rows.

# Keys is comorbidity index, value is # of rows to keep 
d = {0: 1, 1: 3, 2: 2, 3: 20, 4: 2}

mask = df.sample(frac=1).groupby('CM', sort=False).cumcount().lt(df['CM'].map(d))
df1 = df[mask]

# Different subset of rows but still 1 row with CM0, 3 with CM1, ...

    id  CM
9    9   0
5    5   1
15  15   1
17  17   1
6    6   2
7    7   2
1    1   3
4    4   3
10  10   3
0    0   4
12  12   4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM