简体   繁体   中英

Pandas group by rows chained across two columns

I have a dataframe like this:

>>> df = pd.DataFrame([
...     ['a1', None, 1],
...     ['a2', 'a1', 2],
...     ['a3', 'a2', 3],
...     ['b1', None, 9],
...     ['b2', 'b1', 8],
...     ['b3', 'b2', 7],
... ], columns=['key', 'key_prev', 'val'])
>>> df
  key key_prev  val
0  a1     None    1
1  a2       a1    2
2  a3       a2    3
3  b1     None    9
4  b2       b1    8
5  b3       b2    7

Here, key and key_prev are chained. In the above, there are two chains:

a1 -> a2 -> a3
b1 -> b2 -> b3

I'd like to group rows by the chain they belong to. In the above example, I'd like something like:

>>> df.groupby(lambda i: df.iloc[i]['key'][0]).sum()
   val
a    6
b   24

However, key and key_prev can be arbitrary strings, eg:

>>> df = pd.DataFrame([
...     ['a', None, 1],
...     ['c', 'a',  2],
...     ['b', 'c',  3],
...     ['p', 'b',  4],
...     ['r', 'p',  5],
...     ['x', None, 9],
...     ['q', 'x',  8],
...     ['e', 'q',  7],
... ], columns=['key', 'key_prev', 'val'])
>>> df
  key key_prev  val
0   a     None    1
1   c        a    2
2   b        c    3
3   p        b    4
4   r        p    5
5   x     None    9
6   q        x    8
7   e        q    7

In the above, the chains are:

a -> c -> b -> p -> r
x -> q -> e

so the above example approach of taking the first letter as a grouping criteria doesn't work.

I can manually iterate the rows and assign a group to each row, then group:

>>> km = dict()
>>> for i, r in df.iterrows():
...     df.at[i, 'grp'] = km[r['key']] = km.get(r['key_prev'], r['key'])
...
>>> df.groupby('grp').sum()
     val
grp
a     15
x     24

but I was wondering if there's a better approach.

EDIT: Note that the rows are not necessarily consecutive, ie groups can be intertwined, for example:

df = pd.DataFrame([
    ['a', None, 1], # group a
    ['x', None, 9], # group x
    ['c', 'a',  2], # group a
    ['q', 'x',  8], # group x
    ['b', 'c',  3], # group a
    ['p', 'b',  4], # group a
    ['e', 'q',  7], # group x
    ['r', 'p',  5], # group a
], columns=['key', 'key_prev', 'val'])

We can try use isnull with cumsum create the group key

out = df.groupby(df.key_prev.isnull().cumsum()).agg({'key':'first','val':'sum'})
Out[309]: 
         key  val
key_prev         
1          a   15
2          x   24

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM