I have a following data frame:
id subid a
1 1 1 2
2 1 1 10
3 1 1 20
4 1 2 30
5 1 2 35
6 1 2 36
7 1 2 40
8 2 2 20
9 2 2 29
10 2 2 30
I want to apply say for example pandas diff() function on column "a", but the function should be reapplied whenever either "id" or "subid" is being changed, and want to store the values in a new column.
Below is the df I expect.
id subid a difference
1 1 1 2 NaN
2 1 1 10 8
3 1 1 20 10
4 1 2 30 NaN
5 1 2 35 5
6 1 2 36 1
7 1 2 40 4
8 2 2 20 NaN
9 2 2 29 9
10 2 2 30 1
As it can be observed at Row-4, and Row-8 either "id" or "subid" is changing, so NaN values are present and diff is calculated in successive rows.
Have used
df["difference"] = df["a"].diff()
which is obviously applied to the whole column, and not the way expected. I have tried using groupby, but it's somehow giving extra rows.
Thanks for any suggestions in advance.
try this:
In [97]: df['difference'] = df.groupby(['id','subid'])['a'].diff()
In [98]: df
Out[98]:
id subid a difference
1 1 1 2 NaN
2 1 1 10 8.0
3 1 1 20 10.0
4 1 2 30 NaN
5 1 2 35 5.0
6 1 2 36 1.0
7 1 2 40 4.0
8 2 1 20 NaN
9 2 1 29 9.0
10 2 1 30 1.0
This is a tricky one. According to your exact wording, you want to reset at every point in which either 'id'
or 'subid'
change. That means even if they change back and forth.
Also, the diff
calculation doesn't make a difference if done within a groupby
context, so I'll calculate it up front and mask when things change.
i = df.id.values
s = df.subid.values
i_chg = np.append(False, i[:-1] != i[1:])
s_chg = np.append(False, s[:-1] != s[1:])
df.assign(difference=df.a.diff().mask(i_chg | s_chg))
id subid a difference
1 1 1 2 NaN
2 1 1 10 8.0
3 1 1 20 10.0
4 1 2 30 NaN
5 1 2 35 5.0
6 1 2 36 1.0
7 1 2 40 4.0
8 2 1 20 NaN
9 2 1 29 9.0
10 2 1 30 1.0
Setup
df = pd.DataFrame({'a': {1: 2, 2: 10, 3: 20, 4: 30, 5: 35, 6: 36, 7: 40, 8: 20, 9: 29, 10: 30},
'id': {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2},
'subid': {1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 2, 8: 1, 9: 1, 10: 1}})
Solution
#Check for each row if the id-subid pair has changed with previous row and then calculate diff accordingly
df['difference'] = df.apply(lambda x: x.a - df.ix[x.name-1].a
if (x.name>1 and x[['id','subid']].equals(df.ix[x.name-1][['id','subid']]))
else np.nan, axis=1)
df
Out[368]:
a id subid difference
1 2 1 1 NaN
2 10 1 1 8.0
3 20 1 1 10.0
4 30 1 2 NaN
5 35 1 2 5.0
6 36 1 2 1.0
7 40 1 2 4.0
8 20 2 1 NaN
9 29 2 1 9.0
10 30 2 1 1.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.