简体   繁体   中英

How to apply a function only on selected rows and columns of pandas data frame?

I have a following data frame:

       id        subid        a
    1  1         1            2 
    2  1         1            10 
    3  1         1            20
    4  1         2            30
    5  1         2            35 
    6  1         2            36 
    7  1         2            40
    8  2         2            20
    9  2         2            29
    10 2         2            30

I want to apply say for example pandas diff() function on column "a", but the function should be reapplied whenever either "id" or "subid" is being changed, and want to store the values in a new column.

Below is the df I expect.

        id        subid        a      difference
    1  1         1            2       NaN
    2  1         1            10      8
    3  1         1            20      10
    4  1         2            30      NaN
    5  1         2            35      5
    6  1         2            36      1
    7  1         2            40      4
    8  2         2            20      NaN
    9  2         2            29      9
    10 2         2            30      1

As it can be observed at Row-4, and Row-8 either "id" or "subid" is changing, so NaN values are present and diff is calculated in successive rows.

Have used

    df["difference"] = df["a"].diff()

which is obviously applied to the whole column, and not the way expected. I have tried using groupby, but it's somehow giving extra rows.

Thanks for any suggestions in advance.

try this:

In [97]: df['difference'] = df.groupby(['id','subid'])['a'].diff()

In [98]: df
Out[98]:
    id  subid   a  difference
1    1      1   2         NaN
2    1      1  10         8.0
3    1      1  20        10.0
4    1      2  30         NaN
5    1      2  35         5.0
6    1      2  36         1.0
7    1      2  40         4.0
8    2      1  20         NaN
9    2      1  29         9.0
10   2      1  30         1.0

This is a tricky one. According to your exact wording, you want to reset at every point in which either 'id' or 'subid' change. That means even if they change back and forth.

Also, the diff calculation doesn't make a difference if done within a groupby context, so I'll calculate it up front and mask when things change.

i = df.id.values
s = df.subid.values
i_chg = np.append(False, i[:-1] != i[1:])
s_chg = np.append(False, s[:-1] != s[1:])

df.assign(difference=df.a.diff().mask(i_chg | s_chg))

    id  subid   a  difference
1    1      1   2         NaN
2    1      1  10         8.0
3    1      1  20        10.0
4    1      2  30         NaN
5    1      2  35         5.0
6    1      2  36         1.0
7    1      2  40         4.0
8    2      1  20         NaN
9    2      1  29         9.0
10   2      1  30         1.0

Setup

df = pd.DataFrame({'a': {1: 2, 2: 10, 3: 20, 4: 30, 5: 35, 6: 36, 7: 40, 8: 20, 9: 29, 10: 30},
 'id': {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2},
 'subid': {1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 2, 8: 1, 9: 1, 10: 1}})

Solution

#Check for each row if the id-subid pair has changed with previous row and then calculate diff accordingly    
df['difference'] = df.apply(lambda x: x.a - df.ix[x.name-1].a 
  if (x.name>1 and x[['id','subid']].equals(df.ix[x.name-1][['id','subid']])) 
  else np.nan, axis=1)

df
Out[368]: 
     a  id  subid  difference
1    2   1      1         NaN
2   10   1      1         8.0
3   20   1      1        10.0
4   30   1      2         NaN
5   35   1      2         5.0
6   36   1      2         1.0
7   40   1      2         4.0
8   20   2      1         NaN
9   29   2      1         9.0
10  30   2      1         1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM