Efficiently Drop Rows in a Pandas Dataframe

Question

I have a dataset like:

    Id   Status

    1     0
    1     0
    1     0
    1     0
    1     1
    2     0
    1     0 # --> gets removed since this row appears after id 1 already had a status of 1
    2     0
    3     0
    3     0

I want to drop all rows of an id after its status became 1, ie my new dataset will be:

    Id   Status

    1     0
    1     0
    1     0
    1     0
    1     1
    2     0
    2     0
    3     0
    3     0

I want to learn how to implement this computation efficiently since I have a very large (200 GB+) dataset.

The solution I currently have is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:

def remove(series):
    indexless = series.reset_index(drop=True)
    ones = indexless[indexless['Status'] == 1]
    if len(ones) > 0:
        return indexless.iloc[:ones.index[0] + 1]

    else:
        return indexless

df.groupby('Id').apply(remove).reset_index(drop=True)

However, this runs very slowly, any way to fix this or to alternatively speed up the computation?

Answer 1

First idea is create cumulative sum per groups with boolean mask, but also necessary shift for avoid lost first 1 :

#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
   Id  Status
0   1       0
1   1       0
2   1       0
3   1       0
4   1       1
5   2       0
7   2       0
8   3       0
9   3       0

Another solution is use custom lambda function with Series.idxmax :

def f(x):
    if x['new'].any():
        return x.iloc[:x['new'].idxmax()+1, :]
    else:
        return x

df1 = (df.assign(new=(df['Status'] == 1))
        .groupby(df['Id'], group_keys=False)
        .apply(f).drop('new', axis=1))
print (df1)
    Id  Status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
8    2       0
9    3       0
10   3       0

Or a bit modified first solution - filter only groups with 1 and apply solutyion only there:

m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]

m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
            .apply(lambda x: x.shift(fill_value=0).cumsum())
            .eq(0))

df = df[m2.reindex(df.index, fill_value=True)]
print (df)
    Id  Status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
8    2       0
9    3       0
10   3       0

Answer 2

Let's start with this dataset.

l =[[1,0],[1,0],[1,0],[1,0],[1,1],[2,0],[1,0], [2,0], [2,1],[3,0],[2,0], [3,0]]
df_ = pd.DataFrame(l, columns = ['id', 'status'])

We will find the status=1 index for each id.

status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')

    index
id  
1   4
2   8

Now we join over df_ with status_1_indice

join_table  = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)

Notice .fillna(np.inf) for id's that dont have status=1. Result:

    level_0 id  status  index
0   0   1   0   4.000000
1   1   1   0   4.000000
2   2   1   0   4.000000
3   3   1   0   4.000000
4   4   1   1   4.000000
5   5   2   0   8.000000
6   6   1   0   4.000000
7   7   2   0   8.000000
8   8   2   1   8.000000
9   9   3   0   inf
10  10  2   0   8.000000
11  11  3   0   inf

Required dataframe can be obtained by:

join_table.query('level_0 <= index')[['id', 'status']]

Together:

status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
join_table  = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
required_df = join_table.query('level_0 <= index')[['id', 'status']]


   id   status
0   1   0
1   1   0
2   1   0
3   1   0
4   1   1
5   2   0
7   2   0
8   2   1
9   3   0
11  3   0

I cant vouch for the performance but this is more straight forward than the method in question.

Efficiently Drop Rows in a Pandas Dataframe

Question

2 answers

solution1
1 2019-04-18 06:48:28

solution2
0 2019-04-18 08:54:30

Efficiently Drop Rows in a Pandas Dataframe

Question

2 answers

solution1 1 2019-04-18 06:48:28

solution2 0 2019-04-18 08:54:30

solution1
1 2019-04-18 06:48:28

solution2
0 2019-04-18 08:54:30