[英]Pandas groupby + resample/TimeGrouper for change over months from start
I have a dataframe of employee salary data (sample as follows) where 'Date' refers to when the employee's salary became effective: 我有一个员工薪水数据的数据框(示例如下),“日期”是指员工薪水生效的时间:
Employee Date Salary
PersonA 1/1/2016 $50000
PersonB 3/5/2014 $65000
PersonB 3/1/2015 $75000
PersonB 3/1/2016 $100000
PersonC 5/15/2010 $75000
PersonC 6/3/2011 $100000
PersonC 3/10/2012 $110000
PersonC 9/5/2012 $130000
PersonC 3/1/2013 $150000
PersonC 3/1/2014 $200000
In this example, PersonA started this year at $50,000 and PersonC has been with the company for a while and has received several increases since his start on 5/15/2010. 在此示例中,PersonA今年的起价为50,000美元,PersonC在公司工作了一段时间,自2010年5月15日上任以来,获得了数次涨幅。
I need to convert the Date
column to Months from Start
, on an individual employee basis, where Months from Start
will be in increments of m
months (specified by me). 我需要根据个人雇员将“
Date
列转换为“ Months from Start
,其中“ Months from Start
将以m
个月为增量(由我指定)。 For example, for PersonB, assuming m=12
, the result would be: 例如,对于PersonB,假设
m=12
,结果将是:
Employee Months From Start Salary
PersonB 0 $65000
PersonB 12 $65000
PersonB 24 $75000
This means that at month 0 (employment start), PersonB had a salary of $65,000; 这意味着在第0个月(开始就业),PersonB的工资为$ 65,000; 12 months later his salary was $65,000, and 24 months later his salary was $75,000.
12个月后,他的薪水为$ 65,000,而24个月后,他的薪水为$ 75,000。 Note that the next increment (36 months) would NOT appear on the transformed dataframe for PersonB because that duration exceeds the duration of PersonB's employment (it would be in the future).
请注意,下一个增量(36个月)将不会出现在PersonB的转换后的数据框中,因为该持续时间超过了PersonB的雇用期限(它将在将来)。
Note again that I want to be able to adjust m
to any month increment. 再次注意,我希望能够将
m
调整为任何月份的增量。 If I wanted increments of 6 months ( m=6
), the result would be: 如果我希望增加6个月(
m=6
),结果将是:
Employee Months From Start Salary
PersonB 0 $65000
PersonB 6 $65000
PersonB 12 $65000
PersonB 18 $75000
PersonB 24 $100000
PersonB 30 $100000
As a final step, I would also like to include the employee's salary as of today on the transformed dataframe. 最后,我还要将截至今天的员工薪水包括在转换后的数据框中。 Using PersonB again, and assuming
m=6
, this means that the results would be: 再次使用PersonB并假设
m=6
,这意味着结果将是:
Employee Months From Start Salary
PersonB 0 $65000
PersonB 6 $65000
PersonB 12 $65000
PersonB 18 $75000
PersonB 24 $100000
PersonB 30 $100000
PersonB 32.92 $100000 <--added (today is 32.92 months from start)
Question: is there a programmatic way (I assume using at least one of: groupby
, resample
, or TimeGrouper
) to achieve the desired dataframe described above? 问题:是否有编程方式(我假设至少使用
groupby
, resample
或TimeGrouper
)来实现上述所需的数据帧?
Note: you can assume all employees are active (have not left the company). 注意:您可以假设所有员工都在职(尚未离开公司)。
You can combine group_by and resample to do it. 您可以结合使用group_by并重新采样来做到这一点。 To use resample, you need to have the date as index.
要使用重采样,您需要将日期作为索引。
df.index = pd.to_datetime(df.Date)
df.drop('Date',axis = 1, inplace = True)
Then: 然后:
df.groupby('Employee').resample('6m').pad()
In this case, I'm using 6 month periods. 在这种情况下,我使用的是6个月的期限。 Notice that it will get the last day of each month, I hope it's not gonna be a problem.
请注意,它将在每个月的最后一天,我希望这不会成为问题。 Then you will have:
然后您将拥有:
Employee Date Salary
0 PersonA 2016-01-31 $50000
1 PersonB 2014-03-31 $65000
2 PersonB 2014-09-30 $65000
3 PersonB 2015-03-31 $75000
4 PersonB 2015-09-30 $75000
5 PersonB 2016-03-31 $100000
6 PersonC 2010-05-31 $75000
7 PersonC 2010-11-30 $75000
8 PersonC 2011-05-31 $75000
9 PersonC 2011-11-30 $100000
10 PersonC 2012-05-31 $110000
11 PersonC 2012-11-30 $130000
12 PersonC 2013-05-31 $150000
13 PersonC 2013-11-30 $150000
14 PersonC 2014-05-31 $200000
Now you can create the "months since started" column (cumcount function checks the order in which each row appears within its group). 现在,您可以创建“开始以来的月份数”列(计数功能检查每行在其组中的显示顺序)。 Remember to multiply it by the number of months you're using for each period (in this case, 6):
请记住将其乘以每个周期使用的月数(在本例中为6):
df['Months since started'] = df.groupby('Employee').cumcount()*6
Employee Date Salary Months since started
0 PersonA 2016-01-31 $50000 0
1 PersonB 2014-03-31 $65000 0
2 PersonB 2014-09-30 $65000 6
3 PersonB 2015-03-31 $75000 12
4 PersonB 2015-09-30 $75000 18
5 PersonB 2016-03-31 $100000 24
6 PersonC 2010-05-31 $75000 0
7 PersonC 2010-11-30 $75000 6
8 PersonC 2011-05-31 $75000 12
9 PersonC 2011-11-30 $100000 18
10 PersonC 2012-05-31 $110000 24
11 PersonC 2012-11-30 $130000 30
12 PersonC 2013-05-31 $150000 36
13 PersonC 2013-11-30 $150000 42
14 PersonC 2014-05-31 $200000 48
Hope it helped! 希望能有所帮助!
You can use the groupby
and merge
functionalities of DataFrames
您可以使用
groupby
和merge
的功能DataFrames
>>> import pandas as pd
>>> df = pd.DataFrame([['PersonC','5/15/2010',75000],['PersonC','7/3/2011',100000],['PersonB','3/5/2014',65000],['PersonB','3/1/2015',75000],['PersonB','3/1/2016',100000]],columns=['Employee','Date','Salary'])
>>> df['Date']= pd.to_datetime(df['Date'])
>>> df
Employee Date Salary
0 PersonC 2010-05-15 75000
1 PersonC 2011-07-03 100000
2 PersonB 2014-03-05 65000
3 PersonB 2015-03-01 75000
4 PersonB 2016-03-01 100000
>>> satrt_date = df.groupby('Employee')['Date'].min().to_frame().rename(columns={'Date':'Start Date'})
>>> satrt_date['Employee'] = satrt_date.index
>>> df = df.merge(satrt_date,how='left', on= 'Employee')
>>> df['Months From Start'] = df['Date']-df['Start Date']
>>> df['Months From Start'] = df['Months From Start'].apply(lambda x: x.days)
>>> df['Months From Start']= df['Months From Start'].apply(lambda x: (x/30) - (x/30)%6)
>>> df
Employee Date Salary Start Date Months From Start
0 PersonC 2010-05-15 75000 2010-05-15 0
1 PersonC 2011-07-03 100000 2010-05-15 12
2 PersonB 2014-03-05 65000 2014-03-05 0
3 PersonB 2015-03-01 75000 2014-03-05 12
4 PersonB 2016-03-01 100000 2014-03-05 24
Here you can replace 6
with a variable called m
and assign arbitrary values to it 在这里,您可以将
6
替换为名为m
的变量,并为其分配任意值
OK, so for the first part of the answer I would do something like this... 好吧,所以对于答案的第一部分,我会做这样的事情...
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Employee': ['PersonA', 'PersonB', 'PersonB', 'PersonB', 'PersonC', 'PersonC', 'PersonC', 'PersonC', 'PersonC', 'PersonC'],
'Date': ['1/1/2016', '3/5/2014', '3/1/2015', '3/1/2016', '5/15/2010', '6/3/2011', '3/10/2012', '9/5/2012', '3/1/2013', '3/1/2014'],
'Salary': [50000 , 65000 , 75000 , 100000 , 75000 , 100000 , 110000 , 130000 , 150000 , 200000]
})
df.Date = pd.to_datetime(df.Date)
m = 6
emp_groups = df.groupby('Employee')
df['months_from_start'] = df.Date - emp_groups.Date.transform(min)
df.months_from_start = df.months_from_start.dt.days / 30 // m * m
m
can be whatever you want it to be. m
可以是您想要的任何形式。 I am calculating the days between the min
date then dividing by the approximate amount of days in a month and then doing a little bit of integer division to "round off" to the window size you want. 我正在计算
min
日期之间的天数,然后除以一个月中的大约天数,然后进行一些整数除法以“舍入”为所需的窗口大小。
This will give you something like this... 这会给你这样的东西...
Date Employee Salary months_from_start
0 2016-01-01 PersonA 50000 0
1 2014-03-05 PersonB 65000 0
2 2015-03-01 PersonB 75000 12
3 2016-03-01 PersonB 100000 24
4 2010-05-15 PersonC 75000 0
5 2011-06-03 PersonC 100000 12
6 2012-03-10 PersonC 110000 18
7 2012-09-05 PersonC 130000 24
8 2013-03-01 PersonC 150000 30
9 2014-03-01 PersonC 200000 42
The second part is a little tricky. 第二部分有些棘手。 I would create a new df and concat to the first...
我要为第一个创建新的df和concat ...
last_date_df = emp_groups.last()
last_date_df.months_from_start = (last_date_df.Date - emp_groups.first().Date).dt.days / 30
last_date_df.reset_index(inplace=True)
pd.concat([df, last_date_df], axis=0)
getting you... 让你...
Date Employee Salary months_from_start
0 2016-01-01 PersonA 50000 0.000000
1 2014-03-05 PersonB 65000 0.000000
2 2015-03-01 PersonB 75000 12.000000
3 2016-03-01 PersonB 100000 24.000000
4 2010-05-15 PersonC 75000 0.000000
5 2011-06-03 PersonC 100000 12.000000
6 2012-03-10 PersonC 110000 18.000000
7 2012-09-05 PersonC 130000 24.000000
8 2013-03-01 PersonC 150000 30.000000
9 2014-03-01 PersonC 200000 42.000000
0 2016-01-01 PersonA 50000 0.000000
1 2016-03-01 PersonB 100000 24.233333
2 2014-03-01 PersonC 200000 46.200000
Many thanks to the provided answers. 非常感谢提供的答案。 Unfortunately, all answers are a little 'off' and didn't quite achieve the goal.
不幸的是,所有答案都有些“偏离”,并没有完全达到目标。 I ended up nesting two
for
loops within list comprehensions to achieve the goal. 我最终在列表推导中嵌套了两个
for
循环以实现目标。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.