简体   繁体   中英

Reversed cumulative sum of a column in pandas.DataFrame

I've got a pandas DataFrame with a boolean column sorted by another column and need to calculate reverse cumulative sum of the boolean column, that is, amount of true values from current row to bottom.

Example

In [13]: df = pd.DataFrame({'A': [True] * 3 + [False] * 5, 'B': np.random.rand(8) })

In [15]: df = df.sort_values('B')

In [16]: df
Out[16]:
       A         B
6  False  0.037710
2   True  0.315414
4  False  0.332480
7  False  0.445505
3  False  0.580156
1   True  0.741551
5  False  0.796944
0   True  0.817563

I need something that will give me a new column with values

3
3
2
2
2
2
1
1

That is, for each row it should contain amount of True values on this row and rows below.

I've tried various methods using .iloc[::-1] but result is not that is desired.

It looks like I'm missing some obvious bit of information. I've starting using Pandas only yesterday.

Reverse column A, take the cumsum, then reverse again:

df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]

import pandas as pd
df = pd.DataFrame(
    {'A': [False, True, False, False, False, True, False, True],
     'B': [0.03771, 0.315414, 0.33248, 0.445505, 0.580156, 0.741551, 0.796944, 0.817563],},
     index=[6, 2, 4, 7, 3, 1, 5, 0])
df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]
print(df)

yields

       A         B  C
6  False  0.037710  3
2   True  0.315414  3
4  False  0.332480  2
7  False  0.445505  2
3  False  0.580156  2
1   True  0.741551  2
5  False  0.796944  1
0   True  0.817563  1

Alternatively, you could count the number of True s in column A and subtract the (shifted) cumsum:

In [113]: df['A'].sum()-df['A'].shift(1).fillna(0).cumsum()
Out[113]: 
6    3
2    3
4    2
7    2
3    2
1    2
5    1
0    1
Name: A, dtype: object

But this is significantly slower. Using IPython to perform the benchmark:

In [116]: df = pd.DataFrame({'A':np.random.randint(2, size=10**5).astype(bool)})

In [117]: %timeit df['A'].sum()-df['A'].shift(1).fillna(0).cumsum()
10 loops, best of 3: 19.8 ms per loop

In [118]: %timeit df.loc[::-1, 'A'].cumsum()[::-1]
1000 loops, best of 3: 701 µs per loop

类似于 unutbus 第一个建议,但没有弃用的 ix:

df['C']=df.A[::-1].cumsum()

This works but is slow... like @unutbu answer. True resolves to 1. Fails on False, or any other value though.

df[2] = df.groupby('A').cumcount(ascending=False)+1
df[1] = np.where(df['A']==True,df[2],None)
df[1] = df[1].fillna(method='bfill').fillna(0)
del df[2]

      A         B    1
# 3  False  0.277557  3.0
# 7  False  0.400751  3.0
# 6  False  0.431587  3.0
# 5  False  0.481006  3.0
# 1   True  0.534364  3.0
# 2   True  0.556378  2.0
# 0   True  0.863192  1.0
# 4  False  0.916247  0.0

如果要按列反转累积总和:

(-df).cumsum(axis=1).add(1).shift(1,axis=1,fill_value=1.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM