简体   繁体   English

如何计算分组df的差异?

[英]How to calculate difference on grouped df?

name      date      value
 a      1/1/2011      3
 b      1/1/2011      5
 c      1/1/2011      7
 a      1/2/2011      6
 b      1/2/2011      10
 c      1/2/2011      14

I have a df here where the value is cumulative stats. 我在这里有一个df,其中的值是累积统计信息。 So the Actual value of name: a date: 1/2/2011 is 3 not 6. To get the actual value of a particular day, I need to take that day's value minus the previous day's value. 因此, name: a的实际值name: a date: 1/2/2011是3而不是6。要获取特定日期的实际值,我需要将当天的值减去前一天的值。 I want to calculate the actual value of each name for each date. 我想计算每个日期每个名称的实际值。 Something along the lines of df.groupby(['name', 'date'])['value'].diff() but this code is returning error. df.groupby(['name', 'date'])['value'].diff()代码,但是此代码返回错误。

In the end what I need is 最后我需要的是

name      date   actual value
 a      1/1/2011      3
 b      1/1/2011      5
 c      1/1/2011      7
 a      1/2/2011      3
 b      1/2/2011      5
 c      1/2/2011      7

This can be done in a single line and in a vectorized way. 这可以单行和矢量化的方式完成。

import pandas as pd

df = pd.read_clipboard() # Reading from your question

df['value'] = df.groupby('name')['value'].diff(1).fillna(df['value'])

As was discussed in the comments, it is necessary to reference the original 'values Series when applying fillna to correctly replace the NaN values from diff (this occurs for the first instance of each label in 'name' ). 如评论中所讨论的,在应用fillna来正确替换diffNaN值时,有必要引用原始的'values系列'values (这种情况发生在'name'中每个标签的第一个实例中)。

df['value'] = df['value'].fillna(method='ffill')
df = df.sort_values(by=['name', 'date'])
df['actual'] = df.groupby(['name'])['value'].transform(lambda x: x.diff())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM