简体   繁体   中英

Efficiently Drop Rows in a Pandas Dataframe

I have a dataset like:

    Id   Status

    1     0
    1     0
    1     0
    1     0
    1     1
    2     0
    1     0 # --> gets removed since this row appears after id 1 already had a status of 1
    2     0
    3     0
    3     0

I want to drop all rows of an id after its status became 1, ie my new dataset will be:

    Id   Status

    1     0
    1     0
    1     0
    1     0
    1     1
    2     0
    2     0
    3     0
    3     0

I want to learn how to implement this computation efficiently since I have a very large (200 GB+) dataset.

The solution I currently have is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:

def remove(series):
    indexless = series.reset_index(drop=True)
    ones = indexless[indexless['Status'] == 1]
    if len(ones) > 0:
        return indexless.iloc[:ones.index[0] + 1]

    else:
        return indexless

df.groupby('Id').apply(remove).reset_index(drop=True)

However, this runs very slowly, any way to fix this or to alternatively speed up the computation?

First idea is create cumulative sum per groups with boolean mask, but also necessary shift for avoid lost first 1 :

#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
   Id  Status
0   1       0
1   1       0
2   1       0
3   1       0
4   1       1
5   2       0
7   2       0
8   3       0
9   3       0

Another solution is use custom lambda function with Series.idxmax :

def f(x):
    if x['new'].any():
        return x.iloc[:x['new'].idxmax()+1, :]
    else:
        return x

df1 = (df.assign(new=(df['Status'] == 1))
        .groupby(df['Id'], group_keys=False)
        .apply(f).drop('new', axis=1))
print (df1)
    Id  Status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
8    2       0
9    3       0
10   3       0

Or a bit modified first solution - filter only groups with 1 and apply solutyion only there:

m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]

m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
            .apply(lambda x: x.shift(fill_value=0).cumsum())
            .eq(0))

df = df[m2.reindex(df.index, fill_value=True)]
print (df)
    Id  Status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
8    2       0
9    3       0
10   3       0

Let's start with this dataset.

l =[[1,0],[1,0],[1,0],[1,0],[1,1],[2,0],[1,0], [2,0], [2,1],[3,0],[2,0], [3,0]]
df_ = pd.DataFrame(l, columns = ['id', 'status'])

We will find the status=1 index for each id.

status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')

    index
id  
1   4
2   8

Now we join over df_ with status_1_indice

join_table  = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)

Notice .fillna(np.inf) for id's that dont have status=1. Result:

    level_0 id  status  index
0   0   1   0   4.000000
1   1   1   0   4.000000
2   2   1   0   4.000000
3   3   1   0   4.000000
4   4   1   1   4.000000
5   5   2   0   8.000000
6   6   1   0   4.000000
7   7   2   0   8.000000
8   8   2   1   8.000000
9   9   3   0   inf
10  10  2   0   8.000000
11  11  3   0   inf

Required dataframe can be obtained by:

join_table.query('level_0 <= index')[['id', 'status']]

Together:

status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
join_table  = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
required_df = join_table.query('level_0 <= index')[['id', 'status']]


   id   status
0   1   0
1   1   0
2   1   0
3   1   0
4   1   1
5   2   0
7   2   0
8   2   1
9   3   0
11  3   0

I cant vouch for the performance but this is more straight forward than the method in question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM